Algorithms for Linear and Nonlinear Approximation of Large Data

Viewer
Transcript

ALGORITHMS FOR LINEAR AND NONLINEAR APPROXIMATION OF LARGE DATA Boulos Harb A DISSERTATION

in

Computer and Information Science Presented to the Faculties of the University of Pennsylvania in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy

2007

Sampath Kannan Supervisor of Dissertation

Sudipto Guha Supervisor of Dissertation

Rajeev Alur Graduate Group Chairperson

COPYRIGHT Boulos Harb 2007

‫‪To my Mother‬‬

‫إﻟﻰ واﻟﺪﺗﻲ اﻟﻌﺰﻳﺰة‬

‫‪iii‬‬

Acknowledgements I am deeply indebted to my advisors Sampath Kannan and Sudipto Guha. No words I write here can express my gratitude for their outstanding mentorship and all that they have done for me. My advisors believed in me and supported me though the ups and downs of my life as a graduate student and young researcher. From the moment Sampath welcomed me into the Algorithms group, he encouraged me to work and collaborate with researchers in different fields and a variety of interests. He exemplified this philosophy of outreach and collaboration. Sudipto introduced me to the wonderful world of wavelets and taught me everything I know about them. While surfing the sea of wavelets late at night, Sudipto showed me there is no ‘special’ time to do research. My advisors provided me with two great, if challenging, examples to follow. I presented my initial results on sparse representations using wavelets at KDD05 where I also met Michael Mahoney and Petros Drineas for the first time. I had the pleasure of extending this meeting into a fruitful collaboration during my internship at Yahoo! Research in the Summer of 2006. Michael’s guidance and attention to detail kept my spirits both high and grounded at the same time. I was able to use the work I did at Yahoo! in my dissertation, but it would not have existed without Michael, Petros, Ravi Kumar, and Anirban Dasgupta. Anirban and I burnt the midnight oil working out the details of the result and piling through related works. His creativity and thoroughness made the project a pleasant and highly rewarding experience. I am also grateful to Michael for inviting me to Dagstuhl in Germany iv

where I had a chance to present my work on wavelets to a larger audience; and I wish to thank Petros for all his wholehearted encouragement and advice. To my Ph.D. Committee I express my sincere thanks. Jean Gallier, Ravi Kumar, Fernando Pereira and the Committee Chair, Sanjeev Khanna, took the time from their extremely busy schedules to read the manuscript and provided me with thoughtful and constructive comments. They also provided me with numerous stimulating questions and directions for future research. Jean took time off from his Sabbatical, and Ravi flew coast-to-coast to attend the defense; while Fernando challenged me and kept me focused. Sanjeev also chaired my committee for the qualifying examination where he encouraged me to take on and study new topics regardless of how daunting they may seem. My research would be nowhere without my collaborators. It has been a humbling experience working with Andrew McGregor, Anirban, Junhyong Kim, Li-San Wang, Michael, Petros, Ravi, Sampath, Sanjeev, Stanislav Angelov, Sudipto and Vanja Josifovski. From each I learned something different, and I hope I will get the chance to work again with them in the future. During my graduate studies, I had the pleasure of working with Junhyong on problems in computational biology. Junhyong’s uncanny knowledge in both Biology and Computer Science proved invaluable in formulating a wealth of problems and in inspiring us to take them on. During this time I worked very closely with Stan, whose perseverance helped us carry through and produce results. Stan also meticulously reviewed my introduction; and he and Lokman Tsui sat through practice runs of my defense. The “theory office”: Andrew, Stan, and Sid Suri, were always there to answer questions and help with all the ‘small’ problems I ran into during my research. Long before that, my high-school teachers Samer Khouri and Khaldoun Haddadin in Amman taught me my first pure math classes and paved the road to my current studies. This dissertation would not have been possible without the help of my family. v

My Uncle Sami was unfailingly available to give me advice. My mother supported me in innumerable ways. She forever told me that one day I will get a Ph.D. and always ‘reminded’ me that I am capable. I wish to thank her and my siblings for their endless encouragement during all my, what seemed until now, interminable studies. If I thought I studied for a long time, my sister Rula studied even longer and exemplified hard work to me. Tania provided the necessary push and realitycheck whenever I needed it; while my brother Tariq’s music made the writing and the long hours more pleasant. Finally, readers should thank Sayumi Takahashi who, humorously and tactfully, suggested alterations to my convoluted sentences and corrected my grammar in the dissertation. All remaining errors are mine. I thank her for being there for me even during her busiest times and for showing me a poetic outlook on life, one that exists even when it comes to wavelets.

vi

ABSTRACT ALGORITHMS FOR LINEAR AND NONLINEAR APPROXIMATION OF LARGE DATA Boulos Harb Sampath Kannan and Sudipto Guha A central problem in approximation theory is the concise representation of functions. Given a function or signal described as a vector in high-dimensional space, the goal is to represent it as closely as possible using a linear combination of a small number of (simpler) vectors belonging to a pre-defined dictionary. We develop approximation algorithms for this sparse representation problem under two principal approaches known as linear and nonlinear approximation. The linear approach is equivalent to over-constrained regression. Given f ∈ Rn , an n × B matrix A, and a p-norm, the objective is to find x ∈ RB minimizing kAx − f kp . We assume that B is much smaller than n; hence, the resulting problem is over-constrained. The nonlinear approach offers an extra degree of freedom; it allows us to choose the B representation vectors from a larger set. Assuming A ∈ Rn×m describes the dictionary, here we seek x ∈ Rm with B non-zero components that minimizes kAx − f kp . By providing a fast, greedy one-pass streaming algorithm, we show that the solution to a prevalent restricted version of the problem of nonlinear approximation using a compactly-supported wavelet basis is a O(log n)-approximation to the optimal (unrestricted) solution for all p-norms, p ∈ [1, ∞]. For the important case of the Haar wavelet basis, we detail a fully polynomial-time approximation scheme for all p ∈ [1, ∞] based on a one-pass dynamic programming algorithm that, for p > 1, is also streaming. Under other compactly-supported wavelets, a similar algorithm modified for the given wavelet basis yields a QPTAS. Our algorithms extend to variants of the problem such as adaptive quantization and best-basis selection. For linear over-constrained `p regression, we demonstrate the existence of coresets and present an efficient sampling-based approximation algorithm that computes vii

them for all p ∈ [1, ∞). That is, our algorithm samples a small (independent of n) number of constraints (rows of A and the corresponding elements of f ), then solves an `p regression problem on only these constraints producing a solution that yields a (1 + )-approximation to the original problem. Our algorithm extends to more general and commonly encountered settings such as weighted p-norms, generalized p-norms, and solutions restricted to a convex space.

viii

Contents Abstract

vii

1 Introduction 1.1

1

Roadmap and Summary of Results . . . . . . . . . . . . . . . . . . .

2 Greedy Algorithms for Wavelet Sparse `p Regression 2.1

13 19

Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

19

2.1.1

The Sparse Wavelet Representation Problem . . . . . . . . . .

19

2.1.2

Adaptive Quantization . . . . . . . . . . . . . . . . . . . . . .

21

2.2

Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

22

2.3

Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

2.3.1

Data Streams . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

2.3.2

Compactly Supported Wavelets . . . . . . . . . . . . . . . . .

24

2.4

2.5

Greedy Approximation Algorithms for General Compact Systems and Data Streams . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

33

2.4.1

An `∞ Algorithm and Analysis . . . . . . . . . . . . . . . . .

33

2.4.2

An `p Algorithm and Analysis for p ∈ [1, ∞) . . . . . . . . . .

36

2.4.3

Summary and Examples . . . . . . . . . . . . . . . . . . . . .

37

Adaptive Quantization . . . . . . . . . . . . . . . . . . . . . . . . . .

38

2.5.1

Spectrum Representations . . . . . . . . . . . . . . . . . . . .

40

2.5.2

Bit Complexity Representations . . . . . . . . . . . . . . . . .

41

ix

2.5.3 2.6

Multiplane Representations . . . . . . . . . . . . . . . . . . .

Using the Greedy Algorithm for Compressing Images under non-`2 Error Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3 Approximation Schemes for Haar Sparse `p Regression 3.1

41

43 48

Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

48

3.1.1

The Sparse Representation Problem using the Haar Wavelet .

48

3.1.2

Sparse Representation with Best Basis Selection . . . . . . . .

50

3.1.3

Representations for Data Streams . . . . . . . . . . . . . . . .

51

3.2

Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

53

3.3

A Streaming (1 + ) Approximation for the Haar Wavelet . . . . . . .

54

3.3.1

The Algorithm: A Simple Version . . . . . . . . . . . . . . . .

57

3.3.2

Analysis of the Simple Algorithm . . . . . . . . . . . . . . . .

63

3.3.3

An Improved Algorithm and Analysis . . . . . . . . . . . . . .

64

Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

66

3.4.1

PTAS for multi-dimensional Haar Systems . . . . . . . . . . .

66

3.4.2

QPTAS for General Compact Systems . . . . . . . . . . . . .

68

3.4.3

Workloads . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

70

3.4.4

Quality Versus Time . . . . . . . . . . . . . . . . . . . . . . .

71

3.5

Best Basis Selection from a Dictionary . . . . . . . . . . . . . . . . .

73

3.6

Comparing the Restricted and Unrestricted optimizations . . . . . . .

75

3.6.1

The algorithms . . . . . . . . . . . . . . . . . . . . . . . . . .

76

3.6.2

The Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . . .

76

3.6.3

Quality of Synopsis . . . . . . . . . . . . . . . . . . . . . . . .

78

3.6.4

Running Times . . . . . . . . . . . . . . . . . . . . . . . . . .

78

3.4

4 Sampling Algorithms and Coresets for `p Regression 4.1

81

Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

81

4.1.1

83

Related work . . . . . . . . . . . . . . . . . . . . . . . . . . . x

4.2

Our contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

84

4.3

Preliminaries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

86

4.4

Main technical ingredients . . . . . . . . . . . . . . . . . . . . . . . .

87

4.4.1

Well-conditioned bases . . . . . . . . . . . . . . . . . . . . . .

87

4.4.2

Subspace-preserving sampling . . . . . . . . . . . . . . . . . .

90

The sampling algorithm . . . . . . . . . . . . . . . . . . . . . . . . .

96

4.5.1

Statement of our main algorithm and theorem . . . . . . . . .

96

4.5.2

The first stage of sampling: a constant-factor approximation .

99

4.5.3

The second stage of sampling: a relative-error approximation . 101

4.5

4.6

Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

5 Conclusions and Open Problems

110

5.1

Summary of Contributions . . . . . . . . . . . . . . . . . . . . . . . . 111

5.2

Future Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

xi

List of Figures 1.1

Image compression using wavelets . . . . . . . . . . . . . . . . . . . .

8

1.2

Organization of and problems addressed in dissertation . . . . . . . .

14

2.1

The scaling and wavelet vectors of the Daubechies D2 wavelet . . . .

31

2.2

The Haar basis vectors for R8 . . . . . . . . . . . . . . . . . . . . . .

32

2.3

The coefficient graph of D2 . . . . . . . . . . . . . . . . . . . . . . . .

33

2.4

Pseudocode of the greedy algorithm’s implementation . . . . . . . . .

43

2.5

Example image representation with the Haar wavelet using the optimal `2 strategy and our greedy approximation algorithm under the `∞ and `1 error measures . . . . . . . . . . . . . . . . . . . . . . . . .

2.6

46

Example image representation with the D2 wavelet using the optimal `2 strategy and our greedy approximation algorithm under the `∞ and `1 error measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

47

3.1

An illustration of the Haar dynamic programming procedure . . . . .

60

3.2

The Haar streaming FPTAS for `∞ . . . . . . . . . . . . . . . . . . .

61

3.3

The support regions and coefficient tree structure of the 2-dimensional Haar basis vectors for a 4 × 4 data array. . . . . . . . . . . . . . . . .

66

3.4

An example D2 QPTAS subproblem

. . . . . . . . . . . . . . . . . .

70

3.5

The Saw dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

77

3.6

The DJIA dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

77

xii

3.7

The `∞ error of the three algorithms, UNREST, REST, and HYBRID for the Saw dataset . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.8

The `∞ error of the three algorithms, UNREST, REST, and HYBRID for the Dow data set . . . . . . . . . . . . . . . . . . . . . . . . . . .

3.9

4.1

78

79

Running times of the three algorithms, UNREST, REST, and HYBRID for prefixes of the Dow data set . . . . . . . . . . . . . . . . .

79

Sampling algorithm for `p regression

97

xiii

. . . . . . . . . . . . . . . . . .

Chapter 1 Introduction A central problem in approximation theory is the concise representation of functions. Given a vector of n observations f1 , . . . , fn , or measurements of a signal f taken at values of an independent variable t where fi = f (ti ) for i = 1, . . . n, the goal is to represent the vector, and thus model the signal, using a linear combination of (simple) functions belonging to a pre-defined dictionary. This is a basic scientific problem with many applications, an early one of which included predicting the orbit of the astroid Ceres in the year 1801 [1]. In this dissertation, we develop approximation algorithms for representing a given high-dimensional vector as closely as possible using a linear combination of a small number of dictionary elements. These sparse representations, so-called because the number of dictionary elements we are constrained to use is much smaller than the dimension n of the target vector, have become more pertinent in light of the large amounts of data that we encounter. The benefits we gain from the sparsity, however, are counteracted by a loss in our representation’s fidelity and its ability to model the signal accurately. A sparse representation will be, in general, an approximation to the given vector, and the quality of this approximation is affected by both the sparsity constraint and the choice of dictionary elements we are permitted to utilize for the representation. There are two main approaches in approximation theory that differ in the latter respect: these are 1

known as (i) linear and (ii) nonlinear approximation. We will present algorithms for each of these approaches. A main goal of our algorithms will be to efficiently compute a sparse representation whose performance provably approximates that of the target vector’s optimal representation under the same sparsity constraint. As indicated above, we will work in the discrete setting, and accordingly, we assume that we are given a vector f in high-dimensional space Rn . We are also given a dictionary of m elements of Rn , {a1 , a2 , · · · , am }, and an integer B which we have referred to earlier as the sparsity constraint; hence, B is typically much smaller than n. The goal is to represent f as a linear combination of B elements of the dictionary. That is, we wish to find a vector fˆ with fˆ =

X

x k ak ,

(1.1)

k∈S : |S|=B

that is a good representation of f . We need a way to measure the success of our candidate representation, and for that we will use its `p distance from the target

vector f : f − fˆ p . Recall that if y ∈ Rn , then   (Pn |y |p )1/p 1 ≤ p < ∞ i=1 i kykp = .  max |y | p=∞ i=1,...,n

i

As a simple example, suppose as before that we are given a set of observations fi = f (ti ) of an underlying signal f taken at times ti , i = 1, . . . , n. Suppose we wish to model the signal f as a linear combination of simpler functions ϕk , k = 1, . . . , B where the performance of our model is measured using the least-squares error. Then we can set ak [i] = ϕk (ti ) for i = 1, . . . , n and k = 1, . . . , B, and our task becomes that P Pn ˆ2 of finding fˆ = B k=1 xk ak that minimizes i=1 |fi − fi | . A few things to note about this example. First, if we set Aik = ak [i], then we are simply solving the familiar least-squares regression problem kf − Axk2 . Second, we designed our dictionary {ak }B k=1 . In general, however, we will assume that the dictionary will be given to us. Finally, we were not asked to choose a set of B vectors for our representation. Instead, we were given the B vectors, and we were allowed to use all of them. This 2

is an important distinction between the two approaches we referred to earlier called linear and nonlinear approximation. A B-term linear approximation of f is specified by a linear combination of a given B elements from the dictionary. In other words, the set S of indices of dictionary elements in expression (1.1) is given to us, and all we need to do is to compute

P ˆ

the xk ’s in fˆ = k∈S xk ak that minimize f − f p for the desired p-norm. For simplicity of notation, we will assume that we are given the first B elements of P the dictionary; i.e., fˆ = B k=1 xk ak . Notice that the vector space from which we are approximating any f ∈ Rn is the linear space F[B] = span{ak , k ∈ [B]}, which explains why this type of approximation is called linear approximation. The error of the best representation of f in this space is given by the `p distance of f from

F[B] : E[B],p = minfˆ∈F[B] f − fˆ p . For example, in the case of the 2-norm, this is a least-squares regression problem whose solution is the projection of f onto F[B] . More generally, the problem can be solved using convex programming methods (see, e.g., [10]). We address the problem of finding the best representation of f in the given linear space through embedding f and the dictionary. Specifically, we ask the question: Can we embed f and the dictionary elements in a low -dimensional space, solve the representation problem in this smaller space (hopefully, asymptotically faster than solving it in the original space), and use our answer to approximate the best possible representation of the original target vector in the original space? The embedding we seek is very special in nature. If f˜ and {˜ ak }B k=1 are the embedded s ˜ versions of f and {ak }B ˜k [j] = ak [ij ] for all j = k=1 in R , then f [j] = f [ij ] and a

1, . . . , s and k = 1, . . . , B. More general versions of the problem that we consider include approximating the representations of multiple target vectors at once using one embedding; and, restricting the solution {xk }B k=1 to a convex space. In nonlinear approximation, neither the set of indices S of dictionary elements P nor their corresponding coefficients in the representation fˆ = k∈S xk ak are given 3

to us. The choice of the B vectors to use in the representation is instance specific, meaning it depends on the target vector f . This type of approximation replaces the linear space F[B] with the space of all vectors that can be represented as a linear combination of any B dictionary elements, FB =

X k∈S

xk ak : xk ∈ R, S ⊂ [n], |S| = B

.

This is a nonlinear space since, in general, the sum of two arbitrary vectors in FB uses more than B elements from the dictionary, and thus does not belong to the space. We measure the error of a candidate representation as before using its `p distance from f , and the error of the best representation of f in FB is given by

EB,p = minfˆ∈FB f − fˆ p . We address this nonlinear approximation problem when the given dictionary is a wavelet basis, and we ask the following question: Given f ∈ Rn , can we find a representation fˆ using B terms from the specified wavelet basis that provably approximates the best B-term representation of f in this basis? Additionally, we wish to be able to compute our representation without having to store the whole target vector f . This is important especially when n is very large, and f , for instance, is a time-series signal (think of the example above, when f is a set of n observations fi = f (ti ), i = 1, . . . , n, taken in succession) . It implies, however, that our algorithms can only use sub-linear space. More general versions of the problem that we also consider include having a bound on the number of bits rather than the number of coefficients; and, having multiple bases in the dictionary among which to choose for the representation. In addressing the problems mentioned above we focus on measuring the error (or goodness) of our representation using general `p norms. These problems have all been studied under the `2 error measure; however, it is not clear what techniques are needed for solving them under other `p norms. Unless a space is equipped with the `2 norm, it ceases to be a Euclidean space; hence, we lose all the convenient properties 4

of the associated inner-product. For example, even the notion of projection is not entirely intuitive under non-`2 norms. In this dissertation, we develop machinery that is useful for these more general norms in the context of approximation algorithms. In fact, under the `2 norm, our techniques coincide with known `2 -specific ones; hence, they extend these existing techniques. Before summarizing our results and giving an overview of the dissertation in Section 1.1, we elaborate on the two main problems that we address providing background and more motivation. We start by casting the linear and nonlinear approaches to approximation in matrix notation as (perhaps more familiar) regression problems.

Regression and linear vs. nonlinear approximation. An equivalent way of viewing linear and nonlinear approximation that may be illuminating is the following. We are given a set of vectors, or a dictionary, {ai }m i=1 and a target vector f . Suppose that f and the dictionary vectors belong to Rn , and let A = [a1 , a2 , · · · , am ] be an n × m matrix whose columns are the dictionary elements. In linear approximation, A is partitioned into [AB Am−B ], where AB is a set of B columns from A, and we are asked to find x ∈ RB that minimizes kAB x − f kp . One can think of AB as A multiplied by an m × B matrix Q with at most one 1 in each column. In this case, we wish to find x ∈ RB that minimizes kAQx − f kp . This is a standard regression problem that can be solved by various methods. We will show how one can reduce the dimension of this regression problem before passing it to a solver of one’s choice. Nonlinear approximation, on the other hand, does not hand us the matrix Q, and instead requires that we find both Q and x that minimize kAQx − f kp . Put another way, nonlinear approximation asks that we find x ∈ Rn that optimizes minkxk0 ≤B kAx − f kp where kxk0 counts the number of nonzero coefficients in the vector x. The sparsity constraint kxk0 ≤ B on x is a nonlinear constraint that 5

makes this type of approximation very challenging. In fact, a solution to the arbitrary dictionary scenario presented here implies a solution to Exact Cover by 3-Sets (X3C) [23], and as noted in [43], the same reduction shows that even approximating the error of the optimal B-term representation up to any constant factor is NPhard. We cope with the hardness by restricting the problem to more structured dictionaries. Specifically, we will focus on finding approximation algorithms1 for this representation problem mainly when the dictionary is a wavelet basis. Nonlinear approximation. We study the problem of nonlinear approximation using wavelets from an optimization perspective, which is becoming more salient in light of the rising number of domains for which wavelet approximations are proving valuable. Given an arbitrary target vector f ∈ Rn , a sparsity constraint B, and a wavelet basis {ψk }nk=1 , the goal is to compute fˆ as a linear combination of at most

B basis vectors that minimizes f − fˆ p . Note that the representation fˆ we seek belongs to the nonlinear space FB associated with the given wavelet basis. Let us pause for a moment to give an idea of wavelet bases. A wavelet basis {ψk }nk=1 for Rn is a basis where each vector is constructed by dilating and translating a single function referred to as the mother wavelet ψ. For example the Haar mother wavelet, due to Haar [53], is given by:    1 if 0 ≤ t < 1/2   ψH (t) = −1 if 1/2 ≤ t < 1     0 otherwise The Haar basis for Rn is composed of the vectors ψj,s [i] = 2−j/2 ψH

i−2j s 2j

where

i ∈ [n], j = 1, . . . , log n, and s = 0, . . . , n/2j − 1, plus their orthonormal complement √1 1n . n

This last basis vector is closely related to the Haar multiresolution scaling

1

The overloading of the word “approximation” is unfortunate, but should be clear from context: A nonlinear approximation is a vector that belongs to a nonlinear space as described above, and we will often refer to it as a B-term representation. An approximation algorithm is, as is well known, a Turing Machine that produces a feasible solution to an optimization problem whose value, or error, is a guaranteed factor away from that of an optimum solution to the problem.

6

function φH (t) = 1 if 0 ≤ t < 1 and 0 otherwise. In fact, there is an explicit recipe for constructing the mother wavelet function ψ from φ [63,70] (cf. Section 2.3; see also Daubechies [64], and Mallat [21]). Notice that the Haar mother wavelet is compactly supported on the interval [0, 1). This wavelet, which was discovered in 1910, was the only known wavelet of compact support until Daubechies constructed a family of compactly-supported wavelet bases [20] in 1988 (see also [21, Chp. 6]). Why wavelets? Wavelets are fundamental in the field of nonlinear approximation theory. Nonlinear approximation has a rich history starting from the work of Schmidt [79] in 1907, and has been studied under various contexts. More recently, in a substantial review, DeVore [25] explores the advantages of nonlinear approximation over the simpler linear one. The advantages are investigated in terms of the rate of decay of the approximation error relative to the number of terms in the approximate representation. Indeed, the question usually considered in approximation theory is: Given a basis and a parameter α > 0, what is the class of functions whose B-term approximation error EB,p decays like O(B −α )? The success of wavelet bases in this context was first displayed by DeVore, Jawerth, and Popov [26]. They show that if f is sufficiently “smooth” (belongs to a Besov space), then its B-term approximation error using certain wavelet bases decays quickly. Indeed, they show that the (properly normalized) wavelet coefficients of functions belonging to these function spaces decay rapidly; hence, retaining the largest B of the coefficients suffices to give the result. This leads to wavelet-based algorithms for noise reduction (called wavelet shrinkage) developed by Donoho and Johnstone [28] (see also Donoho et al. [29]). The idea here is that “large” (i.e., larger than a threshold) wavelet coefficients mostly carry signal information, while “small” wavelet coefficients are thought to be mostly noise and can be set to zero. Under a Gaussian noise model, they find thresholds that minimize the expected `2 error. Wavelets have since found extensive use in image representation and signal compression (see, e.g., Mallat [64], and Stollnitz, Derose, and Salesin [80]). The basic 7

paradigm for using wavelets in image compression is shown in Figure 1.1, and it goes as follows [25]. The problem is viewed as a nonlinear approximation with the image as the target function. The wavelet expansion coefficients of the image are computed and the smallest of these are pruned using a threshold. Quantization of the remaining coefficients then takes place. These remaining quantized coefficients are the representation of the approximate image (which can be reconstructed using the inverse wavelet transform). Finally, the representative coefficients are compressed using a loss-less encoder. A more direct encoding approach, however, would optimize the number of bits stored directly. Cohen et al. [16] show decay results for this bit-budget version of the problem that are similar to those developed be DeVore et al. [26].

Figure 1.1: Image compression using wavelets. The dashed box highlights the nonlinear approximation that takes place.

Why are the largest wavelet coefficients retained? Retaining the largest coefficients is the optimal strategy for minimizing the `2 norm in a nonlinear approximation. In the nonlinear compression procedure above, we are computing a representative image fˆ that approximates the original image f and minimizes the `2 error

EB,2 = minfˆ∈FB f − fˆ 2 . As in the case for linear approximation, this problem is well-understood under this error measure. The optimal representation fˆ ∈ FB simply retains the B wavelet coefficients hf, ψk i that are largest in absolute value; P i.e., if |hf, ψk1 i| ≥ |hf, ψk2 i| ≥ · · · ≥ |hf, ψkn i|, then fˆ = B i=1 hf, ψki iψki . In fact, 8

this is true for any orthonormal basis when the error is measured under the `2 norm (cf. Chapter 2). Recently, wavelets have been utilized in databases to facilitate selectivity estimation in query optimization [67], for approximate query processing [12, 87, 88], and for content-based image querying [57, 73]. For example, Chakrabarti et al. [12] show how to create synopses of relational tables using wavelet coefficients, then perform database queries, that are fast but approximate, over these compact sets of retained coefficients. Many researchers observe, however, that the choice of least-squares error is not natural for the representation of data or distributions. Matias, Vitter, and Wang [67], for example, suggest using the `1 and the `∞ error measures. The `1 norm is a robust measure that is less sensitive to outliers than the `2 norm, and it is wellsuited for measuring distributions. Even in image compression, Mallat [64, p. 528] and Daubechies [21, p. 286] point out that while the `2 measure does not adequately quantify perceptual errors, it is used, nonetheless, since other norms are difficult to optimize. We develop techniques for general `p norms and provide approximation algorithms for finding nonlinear approximate representations that minimize these error measures (including `∞ ) under a large class of wavelet bases. We present a greedy algorithm with performance guarantees when the given wavelet is compactly supported, and we describe a dynamic programming algorithm more suited for use in the Haar wavelet case. Given an arbitrary target vector f ∈ Rn , a sparsity constraint B, a p-norm, and, for example, any Daubechies wavelet basis for Rn , our

greedy algorithm constructs a function fˆ ∈ FB whose representation error f − fˆ p is no more than O(log n)EB,p . The algorithm runs in linear time, performs one pass over the given function, and requires logarithmic space to compute this solution fˆ. Hence, our algorithm lends itself well to large data streams where input arrives rapidly as a time-series signal. The crux of the argument is a nonlinear program that we use to obtain a lower bound on the optimal representation error. Even though it 9

is straightforward to solve, this program can be a powerful tool for tackling related problems in nonlinear approximation; for example, we show how it can immediately be applied to the bit-budget version of the problem mentioned earlier in the context of image compression. In the specific case of the Haar wavelet, we present a dynamic programming algorithm that constructs a solution fˆ whose representation error is arbitrarily close to the optimal error EB,p . This algorithm also performs one pass over the target function, and requires sublinear space (i.e., it is streaming) when c

p > 1. Both our algorithms extend to multi-dimensional signals (belonging to Rn , for fixed c ∈ N), and to certain dictionaries composed of multiple wavelet bases. When the dictionary is not composed of a single wavelet basis, the representation problem becomes a form of highly nonlinear approximation (in fact, the arbitrary dictionary case, which as we said is NP-hard even to approximate [23, 43] is considered a form highly nonlinear approximation). A notable example of such redundant dictionaries are those of Coiffman and Wickerhauser [17], which are binary treestructured dictionaries composed of O(n log n) vectors and an exponential number of bases. Given such a tree-structured dictionary of wavelet bases, we show how our algorithms can be used (and combined) to find an approximately-best basis for representing the target f ∈ Rn using B-terms under any given `p error measure. We believe that our two approximation algorithms, and their extensions that we detail in Chapters 2 and 3, may be used effectively in those new domains where wavelets are being utilized for summarizing large data.

Linear approximation. In the linear setting, we address the question of finding a coreset for the given linear approximation problem. A coreset is a subset of the input such that a solution to the problem restricted to this subset results in a near optimal solution to the original problem. In order to elucidate this, let us start by describing how the problem may be viewed. Recall that in the linear approximation case, the B dictionary elements used to 10

represent the target function f ∈ Rn are specified as part of the input. The problem is then equivalent to finding an approximate representation of the target vector f via a linear combination of the columns of an n × B matrix A. More formally, let the linear space F[B] ⊆ Rn be, as before, the span of the given dictionary elements {a1 , a2 , · · · , aB }, and arrange these elements as columns of a matrix A. Then, finding

fˆ ∈ F[B] that minimizes f − fˆ p is, as we described earlier, the familiar regression problem minx∈RB kAx − f kp . In the `2 case, the optimal solution is the projection of f onto span(A); i.e., fˆ = AA+ f where A+ is the Moore-Penrose generalized inverse of A (see, e.g., Golub and van Loan [45, pp. 257–8]). The problem can be formulated as a linear program in the `1 case, and similarly under any other `p norm, it can be viewed as a convex optimization problem with linear constraints: minimize

Pn

p i=1 ti

subject to ATi? x − fi ≤ ti ,

i = 1, . . . , n

fi − ATi? x ≤ ti ,

i = 1, . . . , n

where Ai? is the i-th row of A. However, (in both linear and nonlinear approximation,) we usually think of the sparsity constraint B as being much smaller than the dimension n of the target vector. In such an over-constrained linear approximation or regression problem, it would seem that many of the constraints are superfluous. Indeed, we investigate this hypothesis by addressing the following question: Does there exist a small subset of the constraints whose solution constitutes a good solution to the whole system of constraints? This question is an instance of an important paradigm in the design of algorithms operating on large data; that of finding a small subset of the input, often known as a coreset, such that computations performed only on this subset are sufficient to accurately approximate the optimum of the original problem (see, e.g., [2, 7, 36, 54]). Specifically in our case, if xˆ ∈ RB is the solution to an `p regression problem involving 11

only a small subset of the original constraints, then in order to call this subset a “coreset” we require that kAˆ x − f kp is no more than a factor (1 + ) away from minx kAx − f kp for a given 0 < < 1. Notice that a coreset provides a special embedding of f and the columns of A into a lower dimensional space. Suppose that f˜ and A˜ represent the constraints i1 , . . . , is comprising the coreset, then f˜[j] = f [ij ] and A˜jk = Aij k for all k = 1, . . . , B and j = 1, . . . , s. Drineas, Mahoney, and Muthukrishnan [34] answer the above question positively under the `2 error measure. Assuming the rank of A is d, they construct coresets of size O(d2 /2 ) (and, more recently, O(d log d/2 ) [33]). While their algorithm uses O(nd2 ) time to compute the coreset—which is sufficient time to solve the regression problem exactly—in a subsequent work, Sarl´os [76] replaces the coreset embedding with the Fast Johnson–Lindenstrauss embedding of Ailon and Chazelle [4] to obtain ˜ a (1 + )-approximate solution in O(nd) time. Further, in course of providing a sampling-based approximation algorithm for `1 regression, Clarkson [15] shows that coresets exist and can be computed efficiently for a controlled `1 regression problem. Clarkson first preprocesses the input matrix A to make it well-conditioned with respect to the `1 norm then applies a subgradient-based approximation algorithm to guarantee that the `1 norm of the target vector is conveniently bounded. Coresets ˜ 3.5 /2 ) are thereupon exhibited for this modified regression problem. of size O(d We settle the existence of coresets for the (weighted) `p over-constrained regression problem for all p ∈ [1, ∞), and we present a sampling-based randomized algop

rithm that computes them. The coresets we construct are of size O(dmax{ 2 +1, p}+1 /2 ) (when p 6= 2; if p = 2, it is O(d2 /2 )), where d is, as before, the rank of the given matrix A describing the dictionary. Our algorithm easily extends to situations where we wish to represent multiple target vectors at once, and it is transparent to convex constraints placed on the solution x. Such constraints arise, for example, to rule out unacceptable (e.g., negative) solutions, or to reflect prior knowledge about the solution. 12

In course of proving this result, we develop a sampling strategy that preserves the subspace spanned by the columns of the matrix A. Our subspace-preserving sampling generalizes the sampling techniques of [15] and [34] for the `1 and `2 norms, respectively, to `p norms. We also define a notion of a p-well-conditioned basis which is a basis that behaves in a manner analogous to that of a unitary matrix under a general `p norm. We construct these bases via a geometrical argument similar to the aforementioned conditioning step used by Clarkson [15] for `1 regression. Our construction shows how these bases are a generalization of the QR decomposition from linear algebra. Both the notions of subspace-preserving sampling and p-wellconditioned bases that we define may be of independent interest. In practice, our coreset algorithm could reduce the time needed by a regression solver to compute solutions to large systems since it summarizes the input, paring down the problem to be solved. Our work in Chapter 4 unifies, improves upon, and generalizes the works of Clarkson [15] and Drineas et al. [34] to all p-norms, p ∈ [1, ∞), providing techniques that have potential applications in extending known results to `p norms when p 6= 2. Next we give a roadmap of the dissertation and provide more details about the results appearing in each of chapters 2, 3, and 4. The chapters include additional background material that is relevant to the problem each of them addresses.

1.1

Roadmap and Summary of Results

The basic structure of this dissertation is summarized in Figure 1.2. As explained above we will be addressing two types of regression for approximating a given vector f ∈ Rn : Nonlinear and linear regression. The nonlinear approximation problem on which we mainly concentrate is the following (Chapter 2 and Chapter 3, [51, 52]):

13

!p Regrssion

Chapter 4 Linear (overconstrained) Coresets for !p regression

Chapters 2 & 3 Nonlinear (using wavelets) B -term representation Adaptive quantization Best-basis selection

Figure 1.2: Organization of and problems addressed in dissertation.

Problem 1.1 (B-term Representation). Given f ∈ Rn , a compactly-supported wavelet basis for Rn Ψ = [ψ1 , ψ2 , · · · , ψn ], and an integer B, find a solution vector x ∈ Rn with at most B non-zero components xi such that kΨx − f kp is minimized. We will often refer to this problem as the unrestricted B-term representation problem in order to contrast it with a restricted version where the non-zero components of x can only take on values from the set {hf, ψi i, i ∈ [n]}. That is, in the restricted version, each xi can only be set to a coefficient from the wavelet expansion of f using Ψ, or zero. • We show that the solution for the restricted version of the problem is a O(log n) approximation to the optimal solution for the unrestricted version for all pnorms, p ∈ [1, ∞]. That is, if xˆ is the optimal solution to the restricted version, then kΨˆ x − f kp ≤ O(log n) minkxk0 ≤B kΨx − f kp . (The big-Oh hides a small constant factor.) We show this by giving a dual system whose optimal value is a lower bound on the optimal error for the unrestricted problem. The dual system is easy to solve and its optimal solution is comprised of only B wavelet coefficients, thus its solution is also a solution to the restricted problem (Section 2.4). • For the Haar wavelet, we use the dual system of constraints mentioned above along with another lower bound that uses vectors associated with the Haar 14

scaling function φH to tightly bound the B values that the unrestricted optimal solution chooses. We utilize these bounds in an intricate dynamic program that produces a solution whose error is arbitrarily close to the optimal error. That is, given 0 < < 1, our dynamic program produces xˆ such that, kΨˆ x − f kp ≤ (1 + ) minkxk0 ≤B kΨx − f kp (Section 3.3). We implemented our algorithm, and we tested heuristics that reduce its time and space complexities without significantly affecting the quality of the computed solution (Sections 3.4.4 and 3.6). Further, the same algorithm yields a (1 + ) approximation for other compactly-supported wavelets; however, in these other cases, it requires nO((log n)/p) time (Section 3.4.2). This shows that Problem 1.1 is not Max-SNP-Hard (unless NP is in pseudo-polynomial time). • Both of these algorithms are one-pass streaming algorithms, meaning they use space that is sublinear in the size of their input, they access the input sequentially, and once they discard a data element, they never look at it again. The restricted algorithm, which is greedy, runs in O(n) time and O(log n + B) ˜ 1+ p2 B/2 ) time and space (Theorem 2.10). The dynamic program runs in O(n ˜ p1 B 2 /) space (Theorem 3.8)—which implies that it does not stream under O(n the `1 norm. Both algorithms also extend to multiple dimensions when the dimension is a given constant (Section 3.4.1), and to weighted p-norms when the weights are non-zero (Section 3.4.3). As indicated in Figure 1.2, the algorithms we develop can be extended to two variants of our main problem which we will refer to as the adaptive quantization problem and the best-basis selection problem. The B-term representation problem implicitly requires storing 2B numbers: the B values of the solution components that we compute, and the B indices of these components. As alluded to earlier when we described image compression using wavelets, this representation does not directly translate into a compression algorithm. A more practical approach and a natural generalization of our problem is to 15

minimize the error of the representation subject to a given bit-budget. This problem is known as adaptive quantization, and we show how our greedy algorithm extends to variants of this problem (Section 2.5). The best-basis selection problem that we address is a bi-criteria optimization problem where instead of the wavelet basis Ψ, we are given a full binary treestructured dictionary D composed of compactly-supported wavelet bases, and we wish to find the best B-term representation of f in a basis from D. Notice that we are looking for the best basis in D for representing f with B terms and the B-term representation of f using this basis. Our dictionary construction is similar to that of Coiffman and Wickerhauser [17]; it contains O(n log n) vectors that form exponentially many bases, one for each cut in the tree. Given f ∈ Rn and such a dictionary D, we show how to find a basis from D and a B-term representation of f in this basis that provably approximates the best B-term representation of f in a basis from D (Section 3.5). Going back to Figure 1.2, the problem in linear approximation that we address is finding coresets for `p regression (Chapter 4, [19]) Definition 1.1 (Regression Coreset). Let 0 < < 1. A coreset for a regression prob ˆ lem minx∈Rm kAx − bkp is a set of indices I such that the solution xˆ to minx∈Rm Ax−

ˆb , where Aˆ is composed of those rows of A whose indices are in I and ˆb consists p of the corresponding elements of b, satisfies kAˆ x − bkp ≤ (1 + ) minx kAx − bkp . An alternative way to view Aˆ is the following. Let I = {i1 , i2 , · · · , ir }, and let S be an r × n matrix with Skik = wk , where wk is some weight that may depend on the index k ∈ [r]. All other entries in S are set to zero. Then the vector xˆ that we seek is the solution to minx kSAx − Sbkp . We can now easily state the problem we address. Problem 1.2 (Coresets for `p Regression). Let p be a fixed number in [1, ∞). Given an n × m matrix A of rank d n and a vector b ∈ Rn , find a coreset for minx kAx − bkp whose size is independent of the high dimension n. 16

Convex programming methods can be used to solve the over-constrained regression problem minx∈Rm kAx − bkp where A is n × m in time O((mn)c ), for c > 1. Given such a problem, we wish to find a coreset and run the desired solver on the resulting problem in time O(mc n), c > 1. This is the reason we wish to find a small coreset; i.e., its size does not depend on n. Recall that in our earlier discussion we described A as a dictionary of B elements of Rn , where B is much smaller than n. In this setting, we immediately have that d , rank(A) n since d ≤ B. Now suppose we are using the `p norm, p ∈ [1, ∞), to measure the error of our approximate representation, and let k = max{p/2 + 1, p} for p 6= 2. Also, let φ(s, t) be the time required to solve an `p regression problem with s constraints and t variables. • We show that coresets for over-constrained `p regression exist, and we present an efficient randomized algorithm that finds a coreset for the given regression problem (Algorithm 4.1). The algorithm runs in two stages. In the first stage, we compute a set of non-uniform probabilities in O(nd5 log n) time with which we sample rˆ1 = O(36p dk+1 ) rows of A and the corresponding elements of the target vector b, and we solve an `p regression problem on the selected sample. The solution xˆ1 to this problem gives an 8-approximation to the original problem, and it can be found in O (nd5 log n + φ(ˆ r1 , m)) time. We use the residual vector Aˆ x1 − b to construct a new set of probabilities with which we sample rˆ2 = O(36p dk+1 /2 ) rows of A and the corresponding elements of b, and again, we solve an `p regression problem on the selected sample. The new solution xˆ2 to this second induced problem gives a (1 + )-approximation to the original problem. The total running time of finding this near-optimal solution is O (nd5 log n + φ(ˆ r2 , m)) (Theorem 4.4). Our algorithm can be easily extended to more general settings; for example, when the solution is constrained to a convex set (Section 4.6). 17

• Our sampling probabilities are based on the row-norms of a p-well-conditioned basis for the constraint matrix. Essentially, a matrix U is a p-well-conditioned basis for an n × m matrix A of rank d if it is a basis for the space spanned by the columns of A, and if for any x ∈ Rd , kU xkp approximates kxkp/(p−1) by a multiplicative factor that depends only on d. This behavior emulates that of a basis in the 2-norm, where kU xk2 = kxk2/(2−1) = kxk2 . These bases can be thought of as computational analogs of Auerbach and Lewis bases studied in functional analysis [89], and we can construct them in O(nd5 log n) time (Section 4.4.1). We show that sampling the rows of A using probabilities that are based on information in the rows of a p-well-conditioned basis for A produces a submatrix Aˆ of A whose rank equals that of A. In other words, our sampling technique is subspace-preserving as it maintains the structure of the space spanned by the columns of the dictionary matrix A (Section 4.4.2). As indicated throughout, problems 1.1 and 1.2, and our related results are detailed in chapters 2, 3, and 4 respectively. Finally, Chapter 5 lists some future research directions one of which involves using our sampling technique as a feature selection method in kernel-based classification, and another proposes using our Bterm wavelet representation algorithm as a filter to sparsely represent images before attempting to classify them.

18

Chapter 2 Greedy Algorithms for Wavelet Sparse `p Regression 2.1 2.1.1

Background The Sparse Wavelet Representation Problem

Consider one of the most basic problems in nonlinear approximation. Given a wavelet basis {ψi }ni=1 for Rn and a target vector (or signal) f ∈ Rn , construct a representation fˆ as a linear combination of at most B basis vectors so as to minimize some normed distance between f and fˆ. The B-term representation fˆ belongs to the nonlinear P space FB = { ni=1 zi ψi : zi ∈ R, kzk0 ≤ B}, where kzk0 is the number of nonzero coefficients of the vector z. The problem is well-understood if the error of the representation is measured using the Euclidean or `2 norm. Since the `2 norm is preserved under rotations, by Parseval’s theorem, we have 2 X X X

f − fˆ 2 = fi − zj ψj [i] = (hf, ψi i − zi )2 . 2 i

j

i

It is clear then that the solution under this error measure is to retain the largest B inner products hf, ψi i, which are also the coefficients of the wavelet expansion of f . It is noteworthy that the fact that we have to store wavelet coefficients is a 19

consequence of the proof of optimality. Unlike the case of `2 error, however, problems involving `p norms are not well-understood in this setting. This chapter takes a few steps toward filling this gap. The common strategy for the B-term representation problem in the literature has been “to retain the [B] terms in the wavelet expansion of the target function which are largest relative to the norm in which error of approximation is to be measured” [25, p. 4]. This strategy is reasonable in an extremal setting; i.e., if we are measuring the rate of the error as a function of B. An example of such a result is [64, p. 397]: If f is bounded variation, then the nonlinear approximation of f using R1 its largest wavelet coefficients decays like (kf kV /B)2 . Here kf kV = 0 |f 0 (t)|dt is the total variation norm of f . But it is easy to show that the common strategy is sub-optimal, and the only bound on its performance with respect to the target function’s best possible representation using B terms from the wavelet basis is given by Temlyakov [84] (see also [85, Sec. 7]). Temlyakov shows that given f in the R (infinite dimensional) Banach function space Lp [0, 1]; i.e., kf kLp = ( |f |p )1/p < ∞, with 1 < p < ∞, if the given basis Ψ := {ψi }i∈Z is Lp -equivalent to the Haar basis [27], then the error of the common greedy strategy is an α factor away from that of the optimal B-term representation. The paper determines that the factor α depends on p and Ψ, and that the algorithm terminates in a finite number of steps. However, exactly how this factor depends on p and Ψ was not established. From an optimization point of view in the finite-dimensional setting, more specifically, the relationship between the factor α and the dimension n of the space spanned by the basis similarly remained unexplored. Further, this setting opens up the question of what the computational complexity (in terms of running time and space) of the algorithm is as a function of n and the constraint B on the number of terms. We note that, in this finite-dimensional setting which concerns us, several researchers [39, 40, 46,65,66,71] have observed that the common strategy of retaining the largest wavelet coefficients relative to the error norm is sub-optimal. Subsequently, Garofalakis and 20

Kumar [38,40], Muthukrishnan [71], Matias and Urieli [65], and Guha [46] considered a restricted version of the problem under the Haar basis where we may only choose wavelet coefficients of the data (it should be easy to see that when p 6= 2, the best B wavelet coefficients are not necessarily the largest ones). However, the following simple example shows that retaining any B or less wavelet coefficients is sub-optimal by at least a factor 1.2 compared to the optimal solution. Let f = h1, 4, 5, 6i, whose Haar wavelet transform is h4, −1.5, −1.5, −0.5i. The best solution that stores one real value under the `∞ norm is h3.5, 0, 0, 0i, whereas the best solution that is restricted to retaining a wavelet coefficient is h4, 0, 0, 0i. This example can be generalized to any B by repeating the signal with alternating signs; i.e., setting f = h1, 4, 5, 6, −1, −4, −5, −6, 1, 4, 5, 6, · · · i. Hence, the question of the approximation guarantee of these coefficient-retaining strategies with respect to the unrestricted optimum has (until now) remained unclear. We provide an extensive answer to this optimization question for all compactly supported wavelet bases (of Rn ) when the error is measured under any `p norm, 1 ≤ p ≤ ∞. As we will shortly see, we prove that given an arbitrary f ∈ Rn and a compactly-supported wavelet basis {ψi }ni=1 , retaining the coefficients hf, ψi i that are largest in a suitably-scaled order with respect to the specified p-norm gives a O(log n)-approximate solution. That is, the computed solution is at most O(log n) times the optimal solution for the unrestricted optimization problem (which can choose any B real numbers). Our approximation algorithm is a one-pass streaming algorithm that runs in O(n) time and uses O(log n + B) space making it suitable for large data.

2.1.2

Adaptive Quantization

The B-term representation problem implicitly requires storing 2B numbers: the B values of the solution components that we compute, and the B indices of their corresponding basis vectors. The actual cost (in bits) of storing the real numbers zi , 21

however, is not uniform. Depending on the scenario, it may be beneficial to represent a function using a large number of low-support vectors with low precision zi ’s or a few vectors with more detailed precision zi ’s. Hence, a B-term representation algorithm does not translate directly into a practical compression algorithm. A natural generalization, and a more practical model as discussed in [16], is to minimize the error subject to the constraint that the stored values and indices cannot exceed a given bit-budget. Note that, again, we are not constrained here to storing wavelet expansion coefficients. This bit-budget version of the problem is known as adaptive quantization, which we will also consider. To the best of our knowledge, there are no known approximation algorithms for this problem.

Organization. We first summarize our results, then start our discussion by reviewing some preliminaries of wavelets in Section 2.3. In Section 2.4 we present our greedy approximation algorithm which also relates the restricted and the unrestricted versions of the problem. Section 2.5 outlines extensions of our result to the adaptive quantization problem. Finally, in Section 2.6 we give examples that demonstrate the use of our greedy algorithm for compressing images.

2.2

Contributions

We include the definition of the B-term representation problem using wavelets below for easy reference. Problem 2.1 (B-term Representation). Given f ∈ Rn , a compactly-supported wavelet basis {ψi }ni=1 for Rn , an integer B, and p ∈ [1, ∞], find a solution {zi }ni=1 with at

P most B non-zero components zi ∈ R such that f − i zi ψi p is minimized. Recall that we refer to this problem as the unrestricted B-term representation problem in order to contrast it with the restricted version that limits the non-zero 22

components of its solution {zi } to wavelet expansion coefficients of f . That is, in the restricted version, each zi must belong to {hf, ψi i, i ∈ [n]}. As we argued in the previous section, even for this basic form of the problem, we can identify the following issues: • There are no analysis techniques for `p norms. In fact, this is the bottleneck in analyzing any generalization of the B-term representation problem; e.g., the adaptive quantization problem. • All of the (limited) analyses in this optimization setting have been done on the Haar system, which although important, is not the wavelet of choice in some applications. Further, in this setting, the bounds on the performance of the algorithms used in practice which retain wavelet coefficients are unclear. • Signals that require transform coding are often presented as a streaming input; however, no algorithms are known except for (weighted) `2 norms. • The computational complexity of transform coding problems using wavelet bases is unresolved. We ameliorate the above by means of the following: 1. We show that the restricted solution to the B-term representation problem, which retains at most B wavelet coefficients, is a O(log n) approximation to the unrestricted solution under compactly-supported wavelet systems (e.g., Haar, Daubechies, Symmlets, Coiflets, among others) and all `p norms (1 ≤ p ≤ ∞)1 . We provide a O(log n + B) space and O(n) time one-pass algorithm in the data stream model (cf. Section 2.3). We give a greedy strategy that is similar to some scaling strategies used in practice, thus demonstrating why several such scaling-based algorithms used in practice work well. 2. We introduce a new lower-bounding technique using the wavelet basis vectors {ψi }ni=1 , which gives us the above result regarding the gap between the 1

This statement differs from the statement in the extremal setting that says that discarding all coefficients smaller than some threshold τ introduces O(τ log n) error, since the latter does not account for the retained number of terms.

23

restricted and unrestricted versions of the problem. 3. Our result applies to multi-dimensional signals in a straightforward manner (and in fact, we implement a two-dimensional non-streaming version of the algorithm). 4. We show that our lower-bounding technique for the B-term representation problem can be extended to give an approximation algorithm for adaptive quantization. This is the first approximation algorithm for this problem.

2.3

Preliminaries

2.3.1

Data Streams

For the purpose of this paper, a data stream computation is a space bounded algorithm, where the space is sublinear in the input. Input items are accessed sequentially and any item not explicitly stored cannot be accessed again in the same pass. In this paper we focus on one pass data streams. We will assume that we are given numbers f = f (1), . . . , f (i), . . . , f (n) which correspond to the signal f to be summarized in the increasing order of i. This model is often referred to as the aggregated model and has been used widely [42, 47, 50]. It is specially suited to model streams of time series data [11, 59] and is natural for transcoding a single channel. Since we focus on dyadic wavelets (that are dilated by powers of 2; cf. Eq. (2.1)), assuming n is a power of 2 will be convenient (but not necessary). As is standard in literature on streaming [41, 49, 56], we also assume that the numbers are polynomially bounded, i.e., all |f (i)|’s are in the range [n−c , nc ] for some constant c.

2.3.2

Compactly Supported Wavelets

In this section we give a brief introduction to wavelets including some definitions and results, which we will refer to in the rest of this and the next chapter. Readers 24

familiar with wavelets can easily skip this section. For thorough expositions on wavelets, we refer the interested reader to the authoritative texts by Daubechies [21] and Mallat [64]. The material presented here is essentially a (highly) compacted version of [64, Chp. 7]. Since we will be working with compactly-supported wavelets, we start with the definitions of a compactly supported function and a wavelet. Definition 2.1 (Compact Support). A function f : R → R has compact support if there is a closed interval I = [a, b] such that f (x) = 0 for all x 6∈ I. Definition 2.2 (Wavelet). A wavelet ψ is a function of zero average whose set of dilates and translates: ψj,s

1 = √ ψ 2j

t −s 2j

,

(2.1)

j,s∈Z

forms an orthonormal basis for L2 (R). In the above definition, L2 (R) is the space of bounded-energy functions; that R is, |f (t)|2 dt < +∞. The fact that an orthonormal basis can be generated from one function in this manner is remarkable. The underlying mathematical structure can be explained using the notion of multiresolution signal approximations [63, 70]. Intuitively, a multiresolution approximation involves computing approximations of a signal at various scales using orthogonal projections. Definition 2.3 (Multiresolution Approximation). [64, Def. 7.1] A multiresolution approximation for L2 (R) consists of a sequence of closed subspaces {Vj }j∈Z with the following properties, 1. Vj+1 ⊂ Vj , T+∞

Vj = {0}, S +∞ 3. limj→−∞ Vj = Closure V j=−∞ j = L2 (R), 2. limj→+∞ Vj =

j=−∞

4. f (t) ∈ Vj ⇐⇒ f (t − 2j k) ∈ Vj for all k ∈ Z, 5. f (t) ∈ Vj ⇐⇒ f (t/2) ∈ Vj+1 ; and, 25

6. There exists a function φ s.t. {φ(t − s)}s∈Z is an orthonormal basis for V0 . Approximating a function g at a resolution 2−j consists of locally averaging g at discrete intervals of size 2j . Each space Vj in a multiresolution approximation contains all the approximations of functions at the resolution 2−j . These approximations are defined via the orthogonal projections of functions onto Vj . The spaces {Vj }j∈Z are embedded, meaning that the information needed to approximate a function at a resolution 2−j−1 is available in its approximation at the higher resolution 2−j . Essentially, f (t/2) enlarges f (t) implying that if f (t) is an approximation to some function at the resolution 2−j , then f (t/2) is an approximation (to, in general, a different function) at the lower resolution 2−j−1 ; and vice versa. Properties (2) and (3) respectively say that at the lowest resolution, we lose all the details of g, and at the highest resolution, we completely recover g. The scaling function. The function φ is called the scaling function of the multiresolution approximation. Indeed, a multiresolution approximation is completely characterized by this scaling function, which can be used to generate an orthonormal basis for each Vj through 1 φj,s (t) = √ φ 2j

t −s 2j

,

(2.2)

since, for all j, {φj,s }s∈Z is an orthonormal basis for Vj [64, Thm. 7.1]. An example of a multiresolution approximation is piece-wise constant approximations. Here φ is the function that is 1 in the interval [0, 1) and zero elsewhere. The space Vj is the set of functions g that are constant on intervals of size 2j . If a function is constant on intervals of size 2j+1 , then it is also constant on intervals of size 2j ; i.e., Vj+1 ⊂ Vj . This multiresolution system generates the Haar wavelet (more on this later). Another example is Shannon approximations [64, Ex. 7.2]. Here the space Vj is the set of functions whose Fourier transform is supported in the interval [−2−j π, 2−j π]; and an orthonormal basis for V0 is generated by the scaling function φ = sin(πt)/(πt) [64, Prop. 3.2]. 26

The detail spaces. Where do wavelets come in? As we vary s ∈ Z in the wavelet sequence 2.1, the resultant sequence {ψj,s }s∈Z turns out to be an orthonormal basis for the orthogonal complement of Vj in Vj−1 . Intuitively, since Vj ⊂ Vj−1 , this complementary space, call it Wj , provides the details of a function g that are present at the resolution 2−j+1 , but disappear at the lower resolution 2−j . Indeed, we can write Vj−1 = Wj ⊕Vj . Using this expression and properties (2) and (3) of Definition 2.3, we L get that L2 (R) = +∞ j=−∞ Wj . Finally, observe that the detail spaces {Wj }j∈Z are all orthogonal; hence, the sequence (2.1) is an orthonormal basis for L2 (R). But, how do we obtain the wavelet function ψ that generates this sequence? This was done in the wonderful works of Mallat [63] and Meyer [70] by relating the scaling function φ to the wavelet function ψ through conjugate mirror filters in signal processing.

Conjugate mirror filters. Recall that {φ(t−s)}s∈Z is an orthonormal basis for V0 . √ Hence, the scaling function φ(t) ∈ V0 . Property (5) then implies that (1/ 2)φ(t/2) ∈ √ V1 ⊂ V0 . It follows that we can represent (1/ 2)φ(t/2) as a linear combination of the basis vectors {φ(t − s)}s∈Z ; that is, 1 √ φ 2 where h[k] =

D

√1 φ 2

t 2

X t = h[k]φ(t − k) , 2 k

E , φ(t − k) . The above equation relates a linear time-invariant

operation (dilation by 2) to its integer translations. Hence, the sequence h[k] can be interpreted as a discrete filter. In fact, Theorem 7.2 of [64] shows that if φ √ P is a scaling function, then 2, and the Fourier series of h[k] satisfies k h[k] = 2 ˆ ˆ + π)|2 = 2 for all ω ∈ R. Discrete filters that satisfy this last property |h(ω)| + |h(ω

are called conjugate mirror filters. The theorem also gives sufficient conditions, or admissibility conditions, for a conjugate mirror filter to generate a scaling function for a multiresolution approximation. Given a scaling function φ and its corresponding filter h, Theorem 7.3 of [64] 27

then makes constructing the wavelet function ψ immediate: X t 1 √ ψ = g[k]φ(t − k) , 2 2 k where g[k] = (−1)k h[1 − k] is also a conjugate mirror filter. Further,

P

k

g[k] = 0.

Compactly-supported wavelets. The above shows that if we synthesize a conjugate mirror filter h that satisfies the admissibility conditions, then we can construct a scaling function φ, a filter g, and the associated wavelet ψ. Recall that ψ generates orthonormal bases for the detail spaces {Wj }j∈Z , and that {ψj,s }j,s∈Z is an orthonormal basis for L2 (R) where ψj,s is defined in (2.1). Not all wavelet orthonormal bases correspond to multiresolution approximations, however. Lemari´e [61] showed that if ψ does not correspond to a multiresolution approximation, then ψ does not have compact support. We will only be concerned with compactly-supported wavelets in the remainder of this and in the next chapters. It is shown in [21, 64] that the scaling function φ has a compact support if and only if h has a compact support, and that their support regions are equal. The support of ψ has the same size, but it is shifted. In order to minimize the size of the support of ψ, the filter h must have as few non-zero coefficients as possible. The size of the support of ψ determines the number of wavelet coefficients {hf, ψj,s i}s∈Z at each level j of the multiresolution approximation that are affected by the value of the represented function f at a point t0 . However, a small support size implies that ψ will have a small number of vanishing moments. If the function f we wish to represent is ‘regular’, then a wavelet ψ with many vanishing moments produces a large number of wavelet coefficients hf, ψj,s i that are small in amplitude [64, Sec. 6.1]. Hence, there is a tradeoff between the number of vanishing moments and the support size that impacts the wavelet coefficients, and deciding which wavelet to use for the given task is somewhat of an art (see, e.g., [14]). Since the wavelet ψ is generated from the filters h and g, it is not surprising then that these filters play an important role in the fast wavelet transform algorithm 28

which computes a function’s wavelet coefficients.

The fast wavelet transform. Wavelet, or decomposition, coefficients can be computed by a cascade of discrete convolutions with the filters h and g. This is based on the following theorem of Mallat [64, Thm. 7.7]: Theorem 2.1. At the decomposition, φj+1,s =

X

ψj+1,s =

X

h[t − 2s]φj,t ,

(2.3)

g[t − 2s]φj,t .

(2.4)

t

t

At the reconstruction, φj,s =

X

h[s − 2t]φj+1,t +

t

X

g[s − 2t]ψj+1,t .

(2.5)

t

Taking the inner products of f with both sides of (2.3) and (2.4) yield hf, φj,s i and hf, ψj,s i, respectively. When working in the vector space Rn , we can define φ 0,s [i] = δs [i]; i.e., the characteristic vector which is 1 at s and zero everywhere else. This implies that V0 = Rn , and that we start the multiresolution approximation at level j = 1. Further, it implies that there are at most log n levels in the multiresolution approximation. If the filter h, which characterizes the system, has the minimum possible supportsize of 2, then the support of g is also of size 2, and each basis vector ψj+1,s is a linear combination of two scaling vectors at level j as can be seen in equation (2.4). This process generates n − 1 basis vectors. Since Vj−1 = Wj ⊕ Vj , we have Rn = L log n V0 = W j ⊕ Vlog n . Hence, φlog n,0 is the orthogonal complement to the n − 1 j=1 ψj,s basis vectors. If h has 2q nonzero coefficients then ψj,s is defined on at most (2q − 1)2j − 2(q − 1) points, and by a similar reasoning, we have the following two propositions,

29

Proposition 2.2. A compactly-supported wavelet whose filter has 2q non-zero coefficients generates a basis for Rn that has O(q log n) basis vectors with a non-zero value at any point i ∈ [n]. Proposition 2.3. Given i, ψj,s [i] and φj,s [i] can be computed in O(q log n) time. The Cascade Algorithm for computing hf, φj,s i, hf, ψj,s i. Given a function P f ∈ Rn , we set a0 [i] = f [i], and repeatedly compute aj+1 [t] = s h[s − 2t]aj [s] and P dj+1 [t] = s g[s − 2t]aj [s]. Notice that if the filter h has support {0, · · · , 2q − 1}, then we have 0 ≤ s − 2t ≤ 2q − 1. By taking the dot product of f with both sides of (2.3) and (2.4), it is easy to see that aj [t] = hf, φj,t i and dj [t] = hf, ψj,t i. P In order to compute the inverse transform, we evaluate aj [t] = s h[t−2s]aj+1 [s]+ P s g[t − 2s]dj+1 [s]. Observe that by setting a single aj [s] or dj [s] to 1 and the rest to 0, the inverse transform gives us φj,s or ψj,s . Indeed, this is the algorithm usually used to compute φj,s and ψj,s . Basis vectors vs. continuous wavelets. As can be seen by the cascade algorithm, when working in Rn all computations rely on the discrete conjugate mirror filters h and g. At higher levels j of the multiresolution approximation, the discrete wavelet vectors ψj,s [] and scaling vectors φj,s [] computed in equations (2.4) and (2.3) are close to a uniform sampling of their continuous analogs ψj,s () and φj,s () given in (2.1) and (2.2) [64, pp. 264–5]. Even at level j = 3, ψj,s [i] and φj,s [i] are close to their limit values ψj,s (i) and φj,s (i). As an example, Figure 2.1 depicts the basis and scaling vectors at levels 3, 6, and 12 for the Daubechies D2 wavelet (cf. Example II below). Indeed, we will utilize this close resemblance between the vectors and the corresponding continuous-time wavelets and scaling functions in Lemma 2.6. Notation. Since we will be working in Rn , we will only be concerned with discrete wavelet and scaling vectors ψj,s , φj,s ∈ Rn . We will index these vectors with, for example, ψj,s (i) rather than ψj,s [i] unless we need to make a distinction between the 30

2

$"(

D2 !3

D%+!& $"'

D%+!( D%+!

%$$"%

D2 !6

1.5

D2 !12

1

$ !")

0.5 !"(

0

!"' !"%

!0.5 !

!1

!!"% !!"' !

!"#

$

$"#

%

%"#

!1.5 !1

&

!0.5

0

0.5

1

1.5

2

Figure 2.1: The scaling φj,s [] and wavelet ψj,s [] vectors of the Daubechies D2 wavelet. At level (or scale) j = 3, they are already close to convergence.

vectors and their continuous analogs. For ease of notation, we will use both ψk and ψj,s depending on the context, and we will assume that there is a consistent map between them. Example I: The Haar wavelet. In this case, the number of vanishing moments √ √ √ √ q = 1, and h[] = {1/ 2, 1/ 2}. Thus g[] = {1/ 2, −1/ 2}. The algorithm to compute the transform for a given vector f computes the “difference” coefficients √ √ d1 [i] = (f (2i) − f (2i + 1))/ 2. The “averages” (f (2i) + f (2i + 1))/ 2 correspond to the a1 [i]’s. The entire process is repeated on the a1 [i]’s but with n := n/2 (since their number is half of that of the components of f ). In the inverse transform we get, for √ √ √ √ example, a0 [0] = (a1 [0] + d1 [0])/ 2 = ((f (0) + f (1))/ 2 + (f (0) − f (1))/ 2)/ 2 = f (0) as expected. The coefficients naturally define a coefficient tree where the root √ is alog n+1 [0] (the overall average scaled by n) with a single child dlog n [0] (the scaled differences of the averages of the left and right halves). Underneath dlog n [0] lies a complete binary tree. The total number of nodes is 1 + 2log n − 1 = n. For example, when n = 8 the basis vectors are shown in Figure 2.2. The scaling function is φ(x) = 1 for 0 ≤ x < 1, and 0 otherwise; and, the mother wavelet is ψ(t) = 1 if 0 ≤ t < 1/2, ψ(t) = −1 for 1/2 ≤ t < 1, and 0 otherwise. Note that ψ is discontinuous. Thus, synopses or approximations using Haar wavelets are better suited to handle ‘jumps’ or discontinuities in data. This simple wavelet 31

ψ1,0 √1 2 − √12

0 0 0 0 0 0

ψ1,1 0 0 √1 2 − √12

0 0 0 0

ψ1,2 0 0 0 0 √1 2 − √12

0 0

ψ1,3 0 0 0 0 0 0 √1 2 − √12

ψ2,0 1 2 1 2

− 12 − 12 0 0 0 0

ψ2,1 0 0 0 0 1 2 1 2

− 12 − 12

ψ3,0

ψ4,0

√1 8 √1 8 √1 8 √1 8 − √18 − √18 − √18 − √18

√1 8 √1 8 √1 8 √1 8 √1 8 √1 8 √1 8 √1 8

Figure 2.2: The Haar basis vectors when n = 8 arranged in columns.

proposed in 1910 [53] is still useful in a wide number of database applications [67,87]. A natural question to raise, however, is whether there are ‘smoother’ wavelets, and the seminal work of Daubechies gives us various examples [21].

Example II: The D2 Daubechies wavelet. The Daubechies Dq wavelets form a family of compactly-supported wavelets indexed by the wavelet’s number of vanishing moments. In the case of the D2 wavelet, the number of vanishing moments q = 2, and h[] =

n

√ √ √ √ o 1+√ 3 3+√ 3 3−√ 3 1−√ 3 . , , , 4 2 4 2 4 2 4 2

Thus g[] = {h[3], −h[2], h[1], −h[0]}. Examples

of the generated scaling and wavelet vectors are shown in Figure 2.1 (normalized to the domain [0, 1]), and, as can be seen, they converge quite rapidly. The coefficients now form a graph, which is given in Figure 2.3, rather than a tree. The Daubechies wavelets are optimal in the sense that they have a support of minimum size for a given number of vanishing moments [20]. However, they are asymmetric. Daubechies also shows that the Haar wavelets are the unique real symmetric compactly-supported wavelets. To ameliorate the asymmetry (and other) issues in the real domain, several proposals have been made (e.g., Symmlets [20], and Coiflets [22]). Our results are applicable as long as the wavelets are compactlysupported. 32

Figure 2.3: The coefficient graph of D2 . The dotted lines “wrap around” to the start of the row.

2.4

Greedy Approximation Algorithms for General Compact Systems and Data Streams

Recall our optimization problem is: Given a set of basis functions {ψi } and a target function f specified as a vector, we wish to find {zi } with at most B non-zero numbers P to minimize kf − i zi ψi kp . We present two analyses below corresponding to `∞ and `p errors when p ∈ [1, ∞). In each case we begin by analyzing the sufficient conditions that guarantee the error. A (modified) greedy coefficient retention algorithm will naturally fall out of both analyses. The proof shows that several of the algorithms that are used in practice have bounded approximation guarantee. Note that the optimum solution can choose any values in the representation fˆ. In what follows the pair (p, p0 ) are the usual conjugates; i.e.,

1 p

+

1 p0

= 1 when

1 < p < ∞, and when p = 1 we simply set p0 = ∞. For simplicity, we start with the p = ∞ case.

2.4.1

An `∞ Algorithm and Analysis

The main lemma, which gives us a lower bound on the optimal error, is: Lemma 2.4. Let E be the minimum error under the `∞ norm and {zi∗ } be the optimal 33

solution, then −kψi k1 |E| ≤ hf, ψi i − zi∗ ≤ kψi k1 |E| . Proof. For all j we have −|E| ≤ f (j) −

∗ i zi ψi (j)

P

≤ |E|. Since the equation is

symmetric multiplying it by ψk (j) we get, −|E||ψk (j)| ≤ f (j)ψk (j) − ψk (j)

X

zi∗ ψi (j) ≤ |E||ψk (j)|

i

If we add the above equation for all j, since −|E|

P

j

|ψk (j)| = −|E|kψk k1 we obtain

(consider only the left side) −|E|kψk k1 ≤

X

f (j)ψk (j) −

X

j

ψk (j)

j

= hf, ψk i −

X

= hf, ψk i −

X

i

zi∗

X

X

zi∗ ψi (j)

i

ψk (j)ψi (j)

j

zi∗ δik = hf, ψk i − zk∗ .

i

The upper bound follows analogously. A Relaxation. Consider the following program: minimize τ −τ kψ1 k1 ≤ hf, ψ1 i − z1 .. .. . .

≤ τ kψ1 k1 .. .

(2.6)

−τ kψn k1 ≤ hf, ψn i − zn ≤ τ kψn k1 At most B of the zi ’s are nonzero Observe that E is a feasible solution of the above program and E ≥ τ . Also, Lemma 2.4 is not specific to wavelet bases, and indeed we have E = τ when {ψi } is the standard basis, i.e. ψi is the vector with 1 in the ith coordinate and 0 elsewhere. The next lemma is straightforward. Lemma 2.5. The minimum τ of program (2.6) is the (B + 1)th largest value 34

|hf,ψi i| . kψi k1

The Algorithm. We choose the largest B coefficients based on |hf, ψi i|/kψi k1 . This can be done over a one pass stream, and in O(B + log n) space for any compact wavelet basis. Note that we need not choose zi = hf, ψi i but any zi such that |zi − hf, ψi i| ≤ τ kψi k1 . But in particular, we may choose to retain coefficients and set zi = hf, ψi i. The alternate choices may (and often will) be better. Also note that the above is only a necessary condition; we still need to analyze the guarantee provided by the algorithm. Lemma 2.6. For all basis vectors ψi of a compact system there exists a constant C √ s.t. kψi kp kψi kp0 ≤ qC. Proof. Suppose first that p < 2. Consider a basis vector ψi [] = ψj,s [] of sufficiently large scale j that has converged to within a constant r (point-wise) of its continuous analog ψj,s () [64] (cf. Section 2.3). That is, |ψj,s [k] − ψj,s (k)| ≤ r for all k such that ψj,s [k] 6= 0. The continuous function ψj,s () is given by ψj,s (t) = 2−j/2 ψ(2−j t − s), which implies ψj,s [k] = O 2−j/2 ψ(2−j k − s) = O(2−j/2 ). Note that we are assuming kψk∞ itself is some constant since it is independent of n and B. Combining the above with the fact that ψj,s [] has at most (2q)2j non-zero coefficients, we have 0

j( p10 − 21 )

kψj,s kp0 = O(2−j/2 ((2q)2j )1/p ) = O(2

1

(2q) p0 ). 1

1

1

1

1

1

Now by H¨older’s inequality, kψj,s kp ≤ ((2q)2j ) p − 2 kψj,s k2 = 2j( p − 2 ) (2q) p − 2 . j( 1 +

1

−1)

1

+

1

−1

Therefore, for sufficiently large scales j, kψj,s kp kψj,s kp0 = O(2 p p0 (2q) p p0 2 ) = √ O( q), and the lemma holds. For basis vectors at smaller (constant) scales, since the number of non-zero entries is constant, the `p norm and the `p0 norm are both constant. Finally, for p > 2, the argument holds by symmetry. Theorem 2.7. The `∞ error of the final approximation is at most O(q 3/2 log n) times E for any compactly supported wavelet. Proof. Let {zi } be the solution of the system (2.6), and let the set of the inner products chosen be S. Let τ ∗ is the minimum solution of the system (2.6). The `∞ 35

P P error seen at a point j is | i6∈S hf, ψi iψi (j)| ≤ i6∈S |hf, ψi i||ψi (j)|. By Lemma 2.5, P ∗ ∗ this sum is at most i6∈S τ kψi k1 |ψi (j)|, which is at most τ maxi6∈S kψi k1 kψi k∞ times the number of vectors that are non-zero at j. By Proposition 2.2 the number √ of non-zero vectors at j is O(q log n). By Lemma 2.6, kψi k1 kψi k∞ ≤ qC for all i, and we have that the `∞ norm of the error is bounded by O(q 3/2 log n).

2.4.2

An `p Algorithm and Analysis for p ∈ [1, ∞)

Under the `p norm, a slight modification to the algorithm above also gives us an O(q 3/2 log n) approximation guarantee. Lemma 2.8. Let E be the minimum error under the `p norm and {zi∗ } be the optimal solution, then for some constant c0 , X k

1 ∗ p p |hf, ψk i − zk | kψk kp0

! p1 1

≤ (c0 q log n) p E .

Proof. An argument similar to that of Lemma 2.4 gives !1/p

X X X ∗ zj ψj (i)ψk (i) = ξi |ψk (i)| ≤ fi ψk (i) − j

i

1 |hf, ψk i − zk∗ |p ≤ kψk kpp0

⇒ ⇒

i

X k

X

ξip

kψk kp0

i∈ support of ψk

X

ξip

i∈ support of ψk

X p 1 ∗ p ≤ c0 q log n ξi , p |hf, ψk i − zk | kψk kp0 i

where the last inequality follows from Proposition 2.2, that each i belongs to O(q log n) basis vectors. The constant c0 is the constant in the O; it is ≤ 4. A Relaxation. Consider the following system of equations, minimize τ n X i=1

|hf, ψi i − zi |p kψi kpp0

! p1 1

≤ (c0 q log n) p τ

At most B of the zi ’s are nonzero 36

(2.7)

The Algorithm. We choose the largest B coefficients based on |hf, ψk i|/kψk kp0 , which minimizes the system (2.7). This computation can be done over a one pass stream, and in O(B + log n) space. Theorem 2.9. Choosing the B coefficients hf, ψk i that are largest based on the ordering |hf, ψk i|/kψk kp0 is a streaming O(q 3/2 log n) approximation algorithm for the unrestricted optimization problem under the `p norm. Note this matches the `∞ bounds, but stores a (possibly) different set of coefficients. Proof. Let the value of the minimum solution to the above system of equations (2.7) be τ ∗ . Since {zi∗ } is feasible for system (2.7), τ ∗ ≤ E. Assume S is the set of coefficients chosen, the resulting error ES is, p X X X X ESp = (c0 q log n)p−1 |hf, ψk i|p |ψk (i)|p hf, ψk iψk (i) ≤ i i k6∈S k6∈S X p p−1 p = (c0 q log n) |hf, ψk i| kψk kp k6∈S

X Cq p2 p p−1 ≤ (c0 q log n) p |hf, ψk i| kψ k 0 k p k6∈S p

= Cq 2 (τ ∗ c0 q log n)p . Here, the first inequality is H¨older’s inequality combined with Proposition 2.2 and the fact that p/p0 = p−1; the second inequality follows from Lemma 2.6; and the final equality follows from the optimality of our choice of coefficients for the system (2.7). 3

Now since τ ∗ ≤ E, we have that ES ≤ c0 Cq 2 E log n.

2.4.3

Summary and Examples

In the two preceeding subsections we showed the following: Theorem 2.10. Let

1 p

+

1 p0

= 1. Choosing the largest B coefficients based on the

ordering |hf, ψi i|/kψi kp0 , which is possible by a streaming O(B + log n) algorithm, 37

3

gives a O(q 2 log n) approximation algorithm for the unrestricted optimization problem under the `p norm. The argument naturally extends to multiple dimensions. As is well-known, this choice of coefficients is optimal when p = 2 (since p0 = 2 and kψi k2 = 1). Note that the above theorem bounds the gap between the restricted (we can only choose wavelet coefficients of the input in the representation) and unrestricted optimizations.

A tight example for the `∞ measure. Suppose we are given the Haar basis {ψi } and the vector f with the top coefficient hf, ψ1 i = 0 and with hf, ψi i/ kψi k1 = 1 − for i ≤ n/2, and hf, ψi i/ kψi k1 = 1 for i > n/2 (where ψi , i > n/2, are the basis with smallest support). Let B = n/c − 1 where c ≥ 2 is a constant that is a power of 2. The optimal solution can choose the B coefficients which are in the top log n − log c levels resulting in an error bounded by log c. The `∞ error of the greedy strategy on the other hand will be at least log n − 1 because it will store coefficients only at the bottom of the tree. Hence it’s error is at least log n/ log c − o(1) of the optimal.

2.5

Adaptive Quantization

Wavelets are extensively used in the compression of images and audio signals. In these applications a small percent saving of space is considered important and attention is paid to the bits being stored. The techniques employed are heavily engineered and typically designed by some domain expert. The complexity is usually two-fold: First, the numbers zi do not all cost the same to represent. In some strategies; e.g., strategies used for audio signals, the number of bits of precision to represent a coefficient zi corresponding to the basis vector ψi = ψj,s is fixed, and it typically depends only on the scale j. (Recall that there is a mapping from ψi to ψj,s .) Further the aj []’s are computed with a higher precision than the dj []’s. This affects the space 38

needed by the top-most coefficients. In yet another strategy, which is standard to a broad compression literature, it is assumed that log2 z bits are required to represent a number z. All of these bit-counting techniques need to assume that the signal is bounded and there is some reference unit of precision. Second, in several systems, e.g., in JPEG2000 [14], a bitmap is used to indicate the non-zero entries. However the bitmap requires O(n) space and it is often preferred that we store only the status of the non-zero values instead of the status of all values in the transform. In a setting where we are restricted to o(n) space, as in the streaming setting, the space efficiency of the map between non-zero coefficients and locations becomes important. For example, we can represent ψi = ψj,s using log log n + log(n/2j ) + O(1) bits instead of log n bits to specify i. Supposing that √ only the vectors with support of n or larger are important for a particular signal, we will then end up using half the number of bits. Notice that this encoding method increases the number of bits required for storing a coefficient at a small scale j to more than log n. This increase is (hopefully) mitigated by savings at larger scales. Note also that the wavelet coefficients at the same level are treated similarly. The techniques we presented in Section 2.4 naturally extend to these variants of the bit-budget problem. In what follows, we consider three specific cases: 1. Spectrum Representations: The cost ci of storing a coefficient corresponding to i is fixed. This case includes the suggested strategy of using log log n + log(n/2j ) + O(1) bits. 2. Bit Complexity Representations: The cost of storing the ith coefficient with value zi is ci + b(zi ) for some (concave) function b(). A natural candidate for b() is b(z) = O(1) − log zifrac where zifrac is the fractional part of zi and is less than 1 (thus − log zifrac is positive). This encodes the idea that we can store a higher “resolution” at a greater cost. 3. Multiplane Representations: Here the data conceptually consists of several “planes”, and the cost of storing the ith coefficient in one plane depends on 39

whether the ith coefficient in another plane is retained. For example, suppose we are trying to represent a RGBA image which has four attributes per pixel. Instead of regarding the data as 4 × 2 dimensional, it may be more useful, for example if the variations in color are non-uniform, to treat the data as being composed of several separate planes, and to construct an optimization that allocates the bits across them. The fundamental method by which we obtain our approximate solutions to the above three problems is to use a greedy rule to lower bound the errors of the optimal solutions using systems of constraints as we did in Section 2.4. We focus only on the `∞ error for ease of presentation. As before, the techniques we use imply analogous results for `p norms.

2.5.1

Spectrum Representations

In the case where the cost of storing a number for i is a fixed quantity ci we obtain a lower bound via a quadratic program that is similar to (2.6) using Lemma 2.4. That P is, minimize τ with the constraints xi ∈ {0, 1} and i xi ci ≤ B, and for all i − τ kψi k1 ≤ hf, ψi i − xi zi ≤ τ kψi k1

(2.8)

The program above can be solved optimally since the ci ’s are polynomially bounded. We sort the coefficients in non-increasing order of yi := |hf, ψi i|/ kψi k1 . If yi1 ≥ yi2 ≥ P Pk+1 · · · ≥ yin , then we include coefficients i1 , . . . , ik where kj=1 cij ≤ B < j=1 cij . The value yik+1 is then a lower bound on the error E of the optimal representation z ∗ . Note that z ∗ is a feasible solution to program (2.8). Hence, either z ∗ includes coefficients i1 , . . . , ik in which case it cannot choose coefficient ik+1 for it will exceed the space bound B, and we have that E ≥ yik+1 (the optimal does not necessarily set zi = |hf, ψi i|); or, z ∗ does not include one of i1 , . . . , ik , thus E is again greater then or equal to yik+1 . A proof similar to that of Theorem 2.7 shows that the error of our solution is O(log n)E. 40

2.5.2

Bit Complexity Representations

In the case where the cost is dependent on zi we cannot write an explicit system of equations as we did in the case of spectrum representations. However, we can guess τ up to a factor of 2 and verify if the guess is correct. In order to verify the guess, we need to be able to solve equations of the form minz b(z) s.t. |a − z| ≤ t (since this is the format of our constraints). This minimization is solvable for most reasonable cost models; e.g., if b(z) is monotonically increasing. As the coefficients are generated, we compute ci + b(zi ) if zi 6= 0, where zi = argminz b(z) s.t. |hf, ψi i − z| ≤ t kψi k1 for our guess t of the error. If we exceed the alloted space B at any point during the computation, we know that our guess t is too small, and we start the execution over with the guess 2t. Note that the optimal representation is a feasible solution with value E and bit complexity B. Applying the analysis of Section 2.4.1 shows that the first solution we obtain that respects our guess is a O(log n) approximation to the optimal representation. Since we assume that the error E is polynomially bounded, the above strategy can be made to stream by running O(log n) greedy algorithms in parallel each with a different guess of τ as above.

2.5.3

Multiplane Representations

In this case we are seeking to represent data that is conceptually in several “planes” simultaneously; e.g., RGBA in images. We could also conceptualize images of the same object at various frequencies or technologies. The goal of the optimization is to allocate the bits across them. However, notice that if we choose the ith coefficient for say the Red and the Blue planes (assuming that we are indicating the presence or absence of a coefficient explicitly which is the case for a sparse representation), then we can save space by storing the fact that “coefficient i is chosen” only once. This is easily achieved by keeping a vector of four bits corresponding to each chosen coefficient. The values of the entries in the bit vector inform us if the respective 41

coefficient value is present. Therefore, the bit vector 1010 would indicate that the next two values in the data correspond to Red and Blue values of a chosen coefficient. Similarly, a vector 1011 would suggest that three values corresponding to Red, Blue and Alpha are to be expected. In what follows, we assume that the data is D dimensional and it is comprised of t planes (in the RGBA example D = 2 and t = 4). We are constrained to storing at most B bits total for the bit vectors, the indices of the chosen coefficients, and the values of these coefficients. For simplicity we assume that we are using the `∞ error across all the planes. Otherwise, we would also have to consider how the errors across the different planes are combined. We construct our approximate solution by first sorting the coefficients of the t planes in a single non-increasing order while keeping track of the plane to which each coefficient belongs. As before, we add the coefficients that are largest in this ordering to our solution, and stop immediately before the coefficient whose addition results in exceeding the alloted space B. Note that if we had added the ith coefficient of the Red plane first, and thereafter wanted to include the Blue plane’s ith coefficient, then we need only account for the space of storing the index i and the associated bit vector when we add the coefficient for the first (in this case Red) plane. The subsequent ith coefficients only contribute to the cost of storing their values to the solution. (We can think of the cost of storing each coefficient as fixed after the ordering of the coefficients is determined.) This strategy is reminiscent of the strategy used by Guha, Kim and Shim [48] to lower bound the optimum error for a similar problem in the `2 setting. The first coefficient we did not choose using this greedy selection process is a lower bound on the optimal representation error. Now, an argument similar to that of Theorem 2.7 shows that the error of the resulting solution is a O(log n) factor away from the error of the optimal solution. 42

2.6

Using the Greedy Algorithm for Compressing Images under non-`2 Error Measures

In this section we give two examples that demonstrate uses for our greedy algorithm in compressing images. A non-streaming version of the algorithm for Haar and Daubechies wavelets was implemented in Matlab using the Uvi Wave.300 toolbox2 [75]. Pseudocode of the implementation is provided below in Figure 2.4. The algorithm takes four parameters as input: the image X, the number of coefficients to retain B, the p-norm to minimize, and the type of Daubechies wavelet to use. The last parameter, q, determines the number of nonzero coefficients in the wavelet filter (see Section 2.3 and the examples therein). Recall that the Haar wavelet is the Daubechies wavelet with smallest support; i.e., it has q = 1.

Algorithm DaubGreedy(X, B, p, q) 1. (∗ X is a grayscale image (intensity matrix) ∗) 2. Perform a 2D wavelet transform of X using the Daubechies Dq wavelet 3. Let w be the wavelet coefficients of the transform 4. p0 ← p/(p − 1) 5. yi ← |wi |/ kψi kp0 6. Let I be the indices of the B largest yi ’s 7. wi ← 0 if i 6∈ I 8. Perform a 2D inverse wavelet transform on the resulting w 9. Let X 0 be the resulting image representation 10. return X 0 Figure 2.4: Pseudocode of the greedy algorithm’s implementation.

The first example illustrates a use of the `∞ measure for sparse representation using wavelets. Minimizing the maximum error at any point in the reconstructed image implies we should retain the wavelet coefficients that correspond to sharp changes in 2

For compatibility with our version of Matlab, slight modifications on the toolbox were performed. The toolbox can be obtained from http://www.gts.tsc.uvigo.es/∼wavelets/.

43

intensity; i.e., the coefficients that correspond to the “details” in the image. The image we used, shown in Figure 2.5(a), is composed of a gradient background and both Japanese and English texts3 . The number of nonzero wavelet coefficients in the original image is 65524. We set B = 3840 and ran Algorithm DaubGreedy with p = 1, 2 and ∞ under the Haar wavelet (with q = 1). When p = 2, the algorithm outputs the optimal B-term representation that minimizes the `2 error measure. That is, the algorithm simply retains the largest B wavelet coefficients (since p0 = 2 and kψi kp0 = 1 for all i). When p = 1, or p = ∞, the algorithm outputs a O(log n)-approximate B-term representation as explained in Section 2.4. The results are shown in Figure 2.5. Notice that the `∞ representation essentially ignores the gradient in the background, and it retains the wavelet coefficients that correspond to the text in the image. The `1 representation also does better than the `2 representation in terms of rendering the Japanese text; however, the English translation in the former is not as clear. The attribution in the `2 representation, on the other hand, is completely lost. Although the differences between the three representations are not stark, this example shows that under such high compression ratios using the `∞ norm is more suitable for capturing signal details than other norms. The second example illustrates a use of the `1 error measure. Since the `1 norm is robust in the sense that it is indifferent to outliers, the allocation of wavelet coefficients when minimizing the `1 norm will be less sensitive to large changes in intensity than the allocation under the `2 norm. In other words, it implies that under the `1 norm the wavelet coefficients will be allocated more evenly across the image. The image we used, shown in Figure 2.6(a), is a framed black and white matte photograph. The number of nonzero wavelet coefficients in the original image is 65536. We set B = 4096 and ran Algorithm DaubGreedy with p = 1, 2 and ∞ under the Daubechies D2 wavelet. The results are shown in Figure 2.6. Notice that the face of the subject is rendered in the `1 representation more “smoothly” than 3

The Japanese text is poem number 89 of the Kokinshu anthology [69]. The translation is by Helen Craig McCullough.

44

in the `2 representation. Further, the subject’s mouth is not portrayed completely in the `2 representation. As explained earlier, these differences between the two representations are due to the fact that the `1 norm is not as affected as the `2 norm by other conspicuous details in the image; e.g., the frame. The `∞ representation, on the other hand, focuses on the details of the image displaying parts of the frame and the eyes well, but misses the rest of the subject entirely. This example foregrounds some advantages of the `1 norm over the customary `2 norm for compressing images.

45

(a) The original image

(b) Output of the optimal `2 algorithm (which retains the largest B wavelet coefficients)

(c) Output of our greedy algorithm under `∞

(d) Output of our greedy algorithm under `1

Figure 2.5: Representing an image with embedded text using the optimal strategy that minimizes the `2 error, and our greedy approximation algorithm under the `∞ and `1 error measures. The Haar wavelet is used in all three representations, and the number of retained coefficients is B = 3840.

46

(a) The original image

(b) Output of the optimal `2 algorithm (which retains the largest B wavelet coefficients)

(c) Output of our greedy algorithm under `∞

(d) Output of our greedy algorithm under `1

Figure 2.6: Representing an image using the optimal strategy that minimizes the `2 error, and our greedy approximation algorithm under the `∞ and `1 error measures. The Daubechies D2 wavelet is used in all three representations, and the number of retained coefficients is B = 4096.

47

Chapter 3 Approximation Schemes for Haar Sparse `p Regression 3.1 3.1.1

Background The Sparse Representation Problem using the Haar Wavelet

In the previous chapter, we addressed the sparse wavelet representation problem when the given wavelet is compactly-supported. Recall that in this problem, we are asked to represent a target vector f ∈ Rn using at most B basis vector from a given compactly-supported wavelet basis Ψ = {ψi }ni=1 for Rn . We presented a greedy algorithm that retains wavelet expansion coefficients of f , and results in a solution whose error is a O(log n) factor away from that of the optimal B-term representation of f using Ψ even though the optimal may store any B real numbers. In this chapter, we focus on the Haar wavelet basis. We develop a dynamic programming algorithm that does not restrict itself to retaining wavelet coefficients of f , and produces a solution whose `p error is arbitrarily close to that of the optimal solution. In related work, Gibbons and Garofalakis [39] also analyze the Haar case only 48

and store numbers other than wavelet coefficients. They propose a probabilistic framework and compare their solution to solutions that retain wavelet expansion coefficients of the target vector. However, the quality of their solution is not compared to the unrestricted optimal B-term synopsis. Further, Matias and Urieli [66] show that for the Haar system and the weighted `2 norm, if the weights are known in advance, then we can “re-weight” the Haar basis to generate an analogous basis for the Hilbert space equipped with the weighted `2 norm. However, finding a solution that minimizes a general (including weighted) `p norm, 1 ≤ p ≤ ∞, is an important optimization problem which had heretofore remained open. A central issue in this regard is the complexity of representing the optimal solution. Apriori, it is not immediate that the optimal unrestricted solution minimizing say the `5 norm for a function that takes only rational values can be specified by B rational numbers. A closely related question is the computational complexity of finding the optimal solution. Can the solution be found in time polynomial in the size of the input? Or better yet, can the solution be found in strongly polynomial time where the running time of the algorithm does not depend on the numeric values of the input. The natural question that follows is: can we approximate the optimal representation by a solution with B terms whose precisions are bounded, and find this bounded-precision solution fast? This is a classical question from the perspective of approximation algorithms, where we seek to find a solution that is close to the optimum. Note that the use of approximation algorithms does not limit us from using additional heuristics that may be beneficial, but gives us a more organized starting point with provable guarantees that allows us to develop finer heuristics (cf. Section 3.4.4). We show that for a variety of representation problems under the Haar basis, given any > 0 we can find a solution whose error is at most (1 + ) times the best possible error. This provides evidence that these problems are not hard to approximate up to constant factors, i.e., not Max–SNP–Hard (see, e.g., Vazirani [86]), and the only 49

difficulty in optimization arises from the precisions of the numbers involved.

3.1.2

Sparse Representation with Best Basis Selection

One natural generalization of our problem incorporates a choice of basis into the optimization [25]. Here, we are given a dictionary D of bases, and our objective is to choose a best basis in D for representing f using B terms. This bi-criteria optimization problem is a form of highly nonlinear approximation [25]. In a seminal work, Coiffman and Wickerhauser [17] construct a binary tree-structured dictionary composed of O(n log n) vectors and containing 2O(n/2) orthonormal bases. They present a dynamic programming algorithm that, in O(n log n) time, finds a best basis minimizing the entropy of its inner products with the given function f . Mallat [64] discusses generalizations based on their algorithm for finding a basis from the tree dictionary that minimizes an arbitrary concave function of its expansion coefficients. However, a basis in D that minimizes a concave function of its inner products with the given f is not necessarily one with which we can best represent f (in an `p sense) using B terms. Combining our approximation algorithms for the original Bterm representation problem with the algorithm of Coiffman and Wickerhauser, we show how one can construct provably-approximate B-term representations in these tree-structured wavelet dictionaries. Another form of highly nonlinear approximation, and a further generalization, allows for representations using vectors from the dictionary that do not necessarily form a basis. In an arbitrary setting, this problem is NP-hard via a reduction from Exact Cover by 3-Sets (X3C) as shown by Davis, Mallat and Avellaneda [23]. In fact, and as noted by Gilbert, Muthukrishnan, and Strauss [44], the same proof shows that even approximating the error of the optimal B-term representation up to any constant factor is NP-hard. Thus the natural avenue is to investigate more structured dictionaries that arise in practice, and as discussed above we consider the tree-structured dictionaries of Coiffman and Wickerhauser [17]. Our results, 50

therefore, separate these dictionaries and arbitrary dictionaries from the perspective of approximation. In related work, Gilbert et al. [44] focus on redundant dictionaries with small coherence (that is, the vectors in the dictionary are nearly orthogonal). They construct (1+)-approximate representations under the `2 error measure in any such incoherent dictionary. They list the problem of representation using redundant dictionaries under general `p norms as an open question. Their results, however, do not apply to the tree-structured dictionaries that we consider since our dictionaries are highly coherent.

3.1.3

Representations for Data Streams

Along with the development of richer representation structures, recently there has been a significant increase in the sizes of the data sets we are faced with. At the current large scales, input data is not expected to fit in the available memory of even fairly powerful computers. One of the emergent paradigms to cope with this challenge is the idea of data stream algorithms (see, e.g., the excellent survey by Muthukrishnan [72]). In a data stream model the input is provided one at a time, and any input item not explicitly stored is inaccessible to the computation; in other words, it is lost. The challenge is to perform the relevant computation in space that is sublinear in the input size; for example, computing the best representation of a discrete signal f [i] for i ∈ [n] that is presented in increasing order of i in only o(n) space. This is a classic model of time-series data, where the function is presented one value at a time. It is immediate that under this space restriction we may not be able to optimize our representation. Compounded with the issue raised earlier concerning the precision of the solution, the problem of devising approximation algorithms becomes doubly interesting in the streaming context. For the case of `2 error, Gilbert et al. [42] show that the optimal representation can be achieved in a data stream setting. In a significantly stronger model of data streams, where f [i] is specified via a series of updates, Gilbert et al. [41] show how to achieve a (1 + ) 51

approximation under the `2 error measure. Although both these papers state their results in terms of Haar wavelets, their findings extend to more general compactlysupported wavelets because the core ideas of both papers provide a way for maintaining (in the more general update model, only approximately) the largest B terms. However, the critical part of both proofs is Parseval’s identity which does not hold under non-`2 error measures. Under general `p norms, 1 ≤ p ≤ ∞, our greedy approximation algorithms, as shown in Chapter 2, and our approximation scheme for the Haar wavelet, which we present here, apply in the data stream model. (The data-streaming version of the Haar approximation scheme was implemented for the `∞ case.) The principal technique used in this chapter is to lower-bound the optimal solution based on a system of linear equations but with one non-linear constraint as we did in Chapter 2. This lower bound serves to set the “scale” or “precision” of the solution we will seek, and we show that the best solution respecting this precision is a near-optimal solution by “rounding” the components of the optimal solution to this precision. Finally, we find the best solution in this class by utilizing another lower bound, which involves the Haar multiresolution scaling vectors, in a dynamic program adapted to the data stream setting.

Organization. We start by summarizing our results in Section 3.2. Section 3.3 is the main section of the chapter where we detail the fully-polynomial time approximation scheme (FPTAS) for the Haar system; and subsequently in Section 3.4, we show its extensions to multiple dimensions and to workloads. We also demonstrate how the same ideas translate to a quasi-polynomial time approximation scheme (QPTAS) for more general compactly-supported wavelets. In Section 3.5 we present the treestructured best-basis selection algorithm. Finally, in Section 3.6 we display some experimental results contrasting the performance of an optimal algorithm that is restricted to retaining Haar expansion coefficients with our FPTAS and a heuristic 52

based on it.

3.2

Contributions

1. For the B-term representation problem we show that, (a) The unrestricted optimization problem (cf. Section 2.2) has a fully polynomial-time approximation scheme (FPTAS) under all `p error measures in the Haar system; that is, the algorithm runs in time polynomial in B, , and n. The algorithm performs one-pass over the target 1

1

function, requires n p space, and runs in n1+ p time; therefore, it is a streaming algorithm (requiring sublinear space) for all p > 1. For `∞ , the algorithm runs in poly-logarithmic space and near-linear time1 . (b) The result extends to fixed dimensions and workloads with increases in running time and space. (c) For more general compactly-supported wavelets we show how our ideas yield a quasi-polynomial time approximation scheme (QPTAS)2 . This result is in contrast to the case of an arbitrary dictionary which, as we already mentioned, is hard to approximate to within any constant factor even allowing quasi-polynomial time3 . 2. In terms of techniques, we use bounds involving the Haar scaling vectors {φi }ni=1 along with the bounds using the wavelet vectors {ψi }ni=1 that were introduced in Chapter 2 to tightly bound the components of the optimal solution. These bounds, in combination with a dynamic program, give us the approximation schemes. To the best of our knowledge, this is the first use of both the scaling and wavelet vectors to achieve such approximation guarantees. 1

For clarity here, we are suppressing terms based on log n, B, and . The exact statements appear in Theorem 3.8 and Theorem 3.10. c 2 This implies that the running time is 2O(log n) for a constant c (c = 1 gives polynomial time). 3 The hardness of approximation follows from the result of Feige [35].

53

3. For tree-structured dictionaries composed of the type of compactly-supported wavelets that we consider, our algorithms can be combined with the dynamic programming algorithm of Coiffman and Wickerhauser [17] to find a basis in the dictionary and a B-term representation of the given f in this basis. The `p error of the representation we construct provably approximates the error of a best representation of f using B terms from a basis in the dictionary.

3.3

A Streaming (1 + ) Approximation for the Haar Wavelet

In this section we will provide a FPTAS for the Haar system. The algorithm will be bottom up, which is convenient from a streaming point of view. Observe that in case of general `p norm error, we cannot disprove that the optimum solution cannot have an irrational value, which is detrimental from a computational point of view. In a sense we will seek to narrow down our search space, but we will need to preserve near optimality. We will show that there exists sets Ri such that if the solution coefficient zi was drawn from Ri , then there exists one solution which is close to the optimum unrestricted solution (where we search over all reals). In a sense the sets Ri “rescue” us from the search. Alternately we can view those sets as a “rounding” of the optimal solution. Obviously such sets exist if we did not care about the error, e.g. take the all zero solution. We would expect a dependence between the sets Ri and the error bound we seek. We will use a type of “dual” wavelet bases; i.e., where we use one basis to construct the coefficients and another to reconstruct the function. Our bases will differ by scaling factors. We will solve the problem in the scaled bases and translate the solution to the original basis. This overall approach is similar to that presented in [51], however, it is different in several details critical to the proofs of running time, space complexity and approximation guarantee.

54

a b Definition 3.1. Define ψj,s = 2−j/2 ψj,s and ψj,s = 2j/2 ψj,s . Likewise, define φaj,s =

2−j/2 φj,s . √1 h[] 2

Proposition 3.1. The Cascade algorithm used with

computes hf, ψia i and

hf, φai i. We now use this change of basis. The next proposition follows from the definition of {ψib }. Proposition 3.2. The problem of finding a representation fˆ with {zi } and basis {ψi } is equivalent to finding the same representation fˆ using the coefficients {yi } and the basis {ψib }. The correspondence is yi = yj,s = 2−j/2 zj,s . In what follows, ρ will be our quantization parameter used in searching for a solution. We will choose ρ carefully, as its value will determine the running time of our algorithm and the approximation guarantee of the solution it computes. Lemma 3.3. Let {yi∗ } be the optimal solution using the basis set {ψib } for the reP construction, i.e., fˆ = i yi∗ ψib and kf − fˆkp = E. Let {yiρ } be the set where each P ρ b ρ yi∗ is rounded to the nearest multiple of ρ. If f ρ = i yi ψi then kf − f kp ≤ E + O(qn1/p ρ log n). Proof. Let ρi = yi∗ − yiρ . By the triangle inequality,

X

kf − f ρ kp ≤ E + ρi ψib i

.

p

P Proposition 2.2 and the fact that |ρi | ≤ ρ imply | k ρi ψib (k)| ≤ cρq log n maxi |ψib (k)| for a small constant c. This bound gives kf −f ρ kp ≤ E +O(qn1/p ρ log n maxi kψib k∞ ). b Now ψib = ψj,s = 2j/2 ψj,s , and from the proof of Lemma 2.6 we know that for large j, b kψj,s k∞ is at most 2−j/2 times a constant. For smaller j, kψj,s k∞ is a constant.

We will provide a dynamic programming formulation using the new basis. But we still need to show two results; the first concerning the yi∗ ’s and the second concerning the aj []’s. The next lemma is very similar to Lemma 2.4 and follows from the fact √ a that kψj,s k1 = 2−j/2 kψj,s k1 ≤ 2q. 55

√ √ Lemma 3.4. −C0 qE ≤ hf, ψia i − yi∗ ≤ C0 qE for some constant C0 . Now suppose we know the optimal solution fˆ, and suppose we are computing the coefficients aj [] and dj [] for both f and fˆ at each step j of the Cascade algorithm. We wish to know by how much their coefficients differ since bounding this gap would shed more light on the solution fˆ. Proposition 3.5. Let aj [s](F ) be aj [s] computed from a0 [s] = F (s) then aj [s](f ) − aj [s](fˆ) = aj [s](f − fˆ). √ Lemma 3.6. If kf − fˆkp ≤ E then |aj [s](f − fˆ)| ≤ C1 qE for some constant C1 . (We are using

√1 h[].) 2

Proof. The proof is similar to that of Lemma 2.4. Let F = f − fˆ. We know −E ≤ F (i) ≤ E. Multiplying by |φaj,s (i)| and summing over all i we get −Ekφaj,s k1 ≤ hF, φaj,s i = aj [s](F ) ≤ Ekφaj,s k1 . By definition, φaj,s = 2−j/2 φj,s . Further, kφj,s k2 = √ 1 and has at most (2q)2j non-zero values. Hence, kφaj,s k1 ≤ 2q. The lemma follows. At this point we have all the pieces. Summarizing: Lemma 3.7. Let {zi } be a solution with B non-zero coefficients and with represenP tation fˆ = i zi ψi . If kf − fˆkp ≤ E, then there is a solution {yi } with B non-zero P coefficients and representation f 0 = i yi ψib such that for all i we have, (i) yi is a multiple of ρ; √ (ii) |yi − hf, ψia i| ≤ C0 qE + ρ; and, √ (iii) |hf, φai i − hf 0 , φai i| ≤ C1 qE + O(q 3/2 ρ log n), and kf − f 0 kp ≤ E + O(qn1/p ρ log n). P P ∗ Proof. Rewrite fˆ = i zi ψi = i zi∗ ψib where zi∗ = zj,s = 2−j/2 zj,s . Let {yi } be the solution where each yi equals zi∗ rounded to the nearest multiple of ρ. Lemmas 3.4 and 3.6 bound the zi∗ ’s thus providing properties (ii) and (iii). Finally, Lemma 3.3 gives the approximation guarantee of {yi }. 56

The above lemma ensures the existence of a solution {yi } that is O(qn1/p ρ log n) away from the optimal solution and that possesses some useful properties which we shall exploit for designing our algorithms. Each coefficient yi in this solution is a multiple of a parameter ρ that we are free to choose, and it is a constant multiple of E away from the ith wavelet coefficient of f . Further, without knowing the values of those coefficients yj,s contributing to the reconstruction of a certain point f 0 (i), we are guaranteed that during the incremental reconstruction of f 0 (i) using the cascade algorithm, every aj [s](f 0 ) in the support of f 0 (i) is a constant multiple of E away from aj [s](f ) = hf, φaj,s i. This last property allows us to design our algorithms in a bottom-up fashion making them suitable for data streams. Finally, since we may choose ρ, setting it appropriately results in true factor approximation algorithms. Details of our algorithms follow.

3.3.1

The Algorithm: A Simple Version

We will assume here that we know the optimal error E. This assumption can be circumvented by running O(log n) instances of the algorithm presented below ‘in parallel’, each with a different guess of the error. This will increase the time and space requirements of the algorithm by a O(log n) factor, which is accounted for in Theorem 3.8 (and also in Theorem 3.10). We detail the guessing procedure in Section 3.3.1. Our algorithm will be given E and the desired approximation parameter as inputs (see Figure 3.2). The Haar wavelet basis naturally form a complete binary tree, termed the coefficient tree, since their support sets are nested and are of size powers of 2 (with one additional node as a parent of the tree). The data elements correspond to the leaves, and the coefficients correspond to the non-leaf nodes of the tree. Assigning a value y to the coefficient corresponds to assigning +y to all the leaves that are left descendants (descendants of the left child) and −y to all right descendants (recall the definition of {ψib }). The leaves that are descendants of a node in the coefficient 57

tree are termed the support of the coefficient. Definition 3.2. Let E[i, v, b] be the minimum possible contribution to the overall error from all descendants of node i using exactly b coefficients, under the assumption that ancestor coefficients of i will add up to the value v at i (taking account of the signs) in the final solution. The value v will be set later for a subtree as more data arrive. Note that the definition is bottom up and after we compute the table, we do not need to remember the data items in the subtree. As the reader would have guessed, this second property will be significant for streaming. The overall answer is minb E[root, 0, b]—by the time we are at the root, we have looked at all the data and no ancestors exist to set a non-zero v. A natural dynamic program arises whose idea is as follows: Let iL and iR be node i’s left and right children respectively. In order to compute E[i, v, b], we guess the coefficient of node i and minimize over the error produced by iL and iR that results from our choice. Specifically, the computation is: 1. A non-root node computes E[i, v, b] as follows:   min 0 E[i , v + r, b0 ] + E[i , v − r, b − b0 − 1] r,b L R min  min 0 E[i , v, b0 ] + E[i , v, b − b0 ] b

L

R

where the upper term computes the error if the ith coefficient is chosen and it’s value is r ∈ Ri where Ri is the set of multiples of ρ between hf, ψia i − √ √ C0 qE and hf, ψia i + C0 qE; and the lower term computes the error if the ith coefficient is not chosen. 2. Then the root node computes:   min 0 E[i , r, b0 − 1] root coefficient is r r,b C min  min 0 E[i , 0, b0 ] root not chosen b C where iC is the root’s only child. 58

The streaming algorithm will borrow from the paradigm of reduce-merge. The high level idea will be to construct and maintain a small table of possibilities for each resolution of the data. On seeing each item f (i), we will first find out the best choices of the wavelets of length one (over all future inputs) and then, if appropriate, construct/update a table for wavelets of length 2, 4, . . . etc. The idea of subdividing the data, computing some information and merging results from adjacent divisions were used in [50] for stream clustering. The stream computation of wavelets in [42] can be viewed as a similar idea—where the divisions corresponds to the support of the wavelet basis vectors. Our streaming algorithm will compute the error arrays E[i, ·, ·] associated with the internal nodes of the coefficient tree in a post-order fashion. Recall that the wavelet basis vectors, which are described in Section 2.3, form a complete binary tree. For example, the scaled basis vectors for nodes 4, 3, 1 and 2 in the tree of Figure 3.1(a) are [1, 1, 1, 1], [1, 1, −1, −1], [1, −1, 0, 0] and [0, 0, 1, −1] respectively. The data elements correspond to the leaves of the tree and the coefficients of the synopsis correspond to its internal nodes. We need not store the error array for every internal node since, in order to compute E[i, v, b] our algorithm only requires that E[iL , ·, ·] and E[iR , ·, ·] be known. Therefore, it is natural to perform the computation of the error arrays in a post-order fashion. An example best illustrates the procedure. Suppose f = hx1 , x2 , x3 , x4 i. In Figure 3.1(a) when element x1 arrives, the algorithm computes the error array associated with x1 , call it Ex1 . When element x2 arrives Ex2 is computed. The array E[1, ·, ·] is then computed and Ex1 and Ex2 are discarded. Array Ex3 is computed when x3 arrives. Finally the arrival of x4 triggers the computations of the rest of the arrays as in Figure 3.1(b). Note that at any point in time, there is only one error array stored at each level of the tree. In fact, the computation of the error arrays resembles a binary counter. We start with an empty queue Q of error arrays. When x1 arrives, Eq0 is added to Q and the error associated with x1 is stored in it. When 59

2

x1

x2

x3

x4

x1

x2

x3

x1

x4

(a) The arrival of the first 3 elements.

x2

x3

2

1

2

1

2

1

3

3

3

3 1

4

4

4

4

x4

x1

x2

x3

x4

(b) The arrival of x4

Figure 3.1: An example illustrating the computation of the Haar dynamic programming error arrays. Upon seeing x2 node 1 computes E[1, ·, ·] and the two error arrays associated with x1 and x2 are discarded. Element x4 triggers the computation of E[2, ·, ·] and the two error arrays associated with x3 and x4 are discarded. Subsequently, E[3, ·, ·] is computed from E[1, ·, ·] and E[2, ·, ·] and both the latter arrays are discarded. If x4 is the last element on the stream, the root’s error array, E[3, ·, ·], is computed from E[2, ·, ·].

x2 arrives, a temporary node is created to store the error array associated with x2 . It is immediately used to compute an error array that is added to Q as Eq1 . Node Eq0 is emptied, and it is filled again upon the arrival of x3 . When x4 arrives: (1) a temporary Et1 is created to store the error associated with x4 ; (2) Et1 and Eq0 are used to create Et2 ; Et1 is discarded and Eq0 is emptied; (3) Et2 and Eq1 are used to create Eq2 which in turn is added to the queue; Et2 is discarded and Eq1 is emptied. The algorithm for `∞ is shown in Figure 3.2.

Guessing the Optimal Error We have so far assumed that we know the optimal error E. As mentioned at the beginning of Section 3.3.1, we will avoid this assumption by running multiple instances of our algorithm and supplying each instance a different guess Gk of the error. We will also provide every instance Ak of the algorithm with 0 =

√

1+4−1 2

as the approx-

imation parameter. The reason for this will be apparent shortly. Our final answer will be that of the instance with the minimum representation error. Theorem 3.8 shows that the running time and space requirements of our algorithm do not depend on the supplied error parameter. However, the algorithm’s search ranges do depend on the given error. Hence, as long as Gk ≥ E the ranges searched 60

Algorithm HaarPTAS (B, E, ) 1. Let ρ = E/(cq log n) for some suitably large constant c. Note that q = 1 in the Haar case. 2. Initialize a queue Q with one node q0 (∗ Each qi contains an array Eqi of size at most R min{B, 2i } ∗) (∗ and a flag isEmpty ∗) 3. repeat Until there are no elements in the stream 4. Get the next element from the stream, call it e 5. if q0 is empty 6. then Set q0 .a = e. For all values r s.t. |r − e| ≤ c1 E where c1 is a large enough constant and r is a multiple of ρ, initialize the table Eq0 [r, 0] = |r − e| 7. else Create t1 and Initialize Et1 [r, 0] = |r − e| as in Step 6. 8. for i = 1 until the 1st empty qi or end of Q 9. do Create a temporary node t2 . 10. Compute t2 .a = hf, φai i and the wavelet coefficient t2 .o = hf, ψia i. This involves using the a values of t1 and qi−1 (t2 ’s two children in the coefficient tree) and taking their average to compute t2 .u and their difference divided by 2 to compute t2 .o. (Recall that we are using √12 h[]). 11. For all values r that are multiples of ρ with |r − t2 .a| ≤ c1 (E + ρ log n), compute the table Et2 [r, b] for all 0 ≤ b ≤ B. This uses the tables of the two children t1 and qi−1 . The size of the table is O(−1 Bn1/p log n). (Note that the value of a chosen coefficient at node t2 is at most a constant multiple of E away from t2 .o. Keeping track of the chosen coefficients (the answer) costs O(B) factor space more.) 12. Set t1 ← t2 and Discard t2 13. Set qi .isEmtpy = true 14. if we reached the end of Q 15. then Create the node qi 16. Compute Eqi [r, b ∈ B] from t1 and qi−1 as in Step 11. 17. Set qi .isEmpty = false and Discard t1 Figure 3.2: The Haar streaming FPTAS for `∞ .

61

by the k th instance will include the ranges specified by Lemma 3.7. This lemma also tells us that if we search these ranges in multiples of ρ, then we will find a solution whose approximation guarantee is E + cqn1/p ρ log n. Our algorithm chooses ρ so that its running time does not depend on the supplied error parameter. Hence, given Gk and 0 , algorithm Ak sets ρ = 0 Gk /(cqn1/p log n). Consequently, its approximation guarantee is E + 0 Gk . Now if guess Gk is much larger than the optimal error E, then instance Ak will not provide a good approximation of the optimal representation. However, if Gk ≤ (1 + 0 )E, then Ak ’s guarantee will be E + 0 (1 + 0 )E = (1 + )E because of our choice of 0 . To summarize, in order to obtain the desired (1 + ) approximation, we simply need to ensure that one of our guesses (call it Gk∗ ) satisfies E ≤ Gk∗ ≤ (1 + 0 )E Setting Gk = (1 + 0 )k implies that the above bounds will be satisfied when k = k ∗ ∈ [log1+0 (E), log1+0 (E) + 1].

Number of guesses. Note that the optimal error E = 0 if and only if f has at most B non-zero expansion coefficients hf, ψi i. We can find these coefficients easily in a streaming fashion. Since we assume that the entries in the given f are polynomially bounded, by the system of equations (2.6) we know that the optimum error is at least as much as the (B + 1)st largest coefficient. Now any coefficient (hf, ψka i) is the sum of the left half minus the sum of the right half of the fi ’s that are in the support of the basis and the total is divided by the length of the support. Thus if the smallest non-zero number in the input is n−c then the smallest non-zero wavelet coefficient is at least n−(c+1) . By the same logic the largest non-zero coefficient is nc . Hence, it suffices to make O(log n) guesses. 62

3.3.2

Analysis of the Simple Algorithm

The size of the error table at node i, E[i, ·, ·], is Rφ min{B, 2ti } where Rφ = 2C1 E/ρ+ log n and ti is the height of node i in the Haar coefficient tree (the leaves have height 0). Note that q = 1 in the Haar case. Computing each entry of E[i, ·, ·] takes O(Rψ min{B, 2ti }) time where Rψ = 2C0 E/ρ + 2. Hence, letting R = max{Rφ , Rψ }, we have that the total running time is O(R2 B 2 ) for computing the root table plus P 2 O( ni=1 (R min{2ti , B}) ) for computing all the other error tables. Now, n X

log n ti

R min{2 , B}

2

= R

2

i=1

Xn min{22t , B 2 } t 2 t=1 log B

X

= nR2

t=1

log n

B2 2t + 2t t=log B+1

!

X

2

= O(R nB) , where the first equality follows from the fact that the number of nodes at level t is

n . 2t

For `∞ , when computing E[i, v, b] we do not need to range over all values of

B. For a specific r ∈ Ri , we can find the value of b0 that minimizes max{E[iL , v + r, b0 ], E[iR , v − r, b − b0 − 1]} using binary search. The running time thus becomes, X

R2

t

n min{t2t , B log B} = O(nR2 log2 B) . t 2

The bottom up dynamic programming will require us to store the error tables along at most two leaf to root paths. Thus the required space is, 2

X

R min{2t , B} = O(RB(1 + log

t

n )) . B

Since we set ρ = E/(cn1/p log n), we have R = O((n1/p log n)/). Theorem 3.8. Algorithm HaarPTAS is a O(−1 B 2 n1/p log3 n) space algorithm that computes a (1 + ) approximation to the best B-term unrestricted representation of a signal in the Haar system. Under the `p norm, the algorithm runs in time O(−2 n1+2/p B log3 n). Under `∞ the running time becomes O(−2 n log2 B log3 n). 63

The extra B factor in the space required by the algorithm accounts for keeping track of the chosen coefficients.

3.3.3

An Improved Algorithm and Analysis

For large n (compared to B), we gain in running time if we change the rounding scheme given by Lemma 3.3. The granularity at which we search for the value of a coefficient will be fine if the coefficient lies toward the top of the tree, and it will be coarse if the coefficient lies toward the bottom. The idea is that, for small `p norms, a mistake in a coefficient high in the tree affects everyone, whereas mistakes at the bottom are more localized. This idea utilizes the strong locality property of the Haar basis. We start with the lemma analogous to Lemma 3.3. Lemma 3.9. Let {yi∗ }, i = (ti , s) be the optimal solution using the basis set {ψib } for P the reconstruction, i.e., fˆ = i yi∗ ψib and kf − fˆkp = E. Here ti is the height of node i in the Haar coefficient tree. Let {yiρ } be the set where each yi∗ is first rounded to the nearest multiple of ρti = E/(2B2ti /p ) then the resulting value is rounded to the P nearest multiple of ρtroot = E/(2Bn1/p ). If f ρ = i yiρ ψib then kf − f ρ kp ≤ (1 + )E.

P Proof. As in Lemma 3.3, we need to estimate i (yiρ − yi∗ )ψib p but using the new rounding scheme. Let S be the set of indices i such that yi 6= 0.

X X ρ

(y − yi∗ )ψib (yiρ − yi∗ )ψib ≤

i p i∈S i∈S p X

b ≤ (ρti + ρtroot ) ψi p i∈S X ≤ 2 ρti 2ti /p . i∈S

The last inequality follows from the fact that 2ti components of ψib are equal to one and the rest are zero. The approximation hence follows from |S| ≤ B and our choices of ρti . The granularity of the dynamic programming tables E[i, ·, ·] is set according to the smallest ρti which is ρtroot = E/(2Bn1/p ). This allows their values to align correctly. 64

More specifically, when a coefficient is not chosen we compute (see Section 3.3.1) E[i, v, b] = min E[iL , v, b0 ] + E[iR , v, b − b0 ] . 0 b

A value v will that is not outside the range of E[iL , ·, ·] and E[iR , ·, ·] will be a correct index into these two arrays. We gain from this rounding scheme, however, when we are searching for a value to assign to node i. If i is chosen, we can search for its value in the range hf, ψia i ± 2C0 E/ρ in multiples of ρti . Hence, as mentioned earlier, the granularity of our search will be fine for nodes at top levels and coarse for nodes at lower levels. More formally, if i is chosen, we compute E[i, v, b] = min E[iL , v + r, b0 ] + E[iR , v − r, b − b0 − 1] , 0 r,b

where we search for the best r in multiples of ρti . The value v + r (resp. v − r) may not index correctly into E[iL , ·, ·] (resp. E[iR , ·, ·]) since ρti = 2d/p ρtroot where d = troot − ti . Hence, we need to round each value of r we wish to check to the nearest multiple of ρtroot . This extra rounding is accounted for in Lemma 3.9. Letting R be the number of values each table holds and Rti = 2C0 E/ρti + 2 be the number of entries we search at node i, and using an analysis similar to that of Section 3.3.2, the running time (ignoring constant factors) becomes, log n n X n B2t/p X RRti min{22t , B 2 }) = O(R min{22t , B 2 }) O( t 2 t=1 i=1

= O = O(

nRB

log B

X

log n t/p+t

2

+B

2

t=1

X

!! t/p−t

2

t=log B+1

nRB 1+1/p B )

Hence, since R = O(n1/p B/) based on the granularity ρtroot , the running time for each instance of the algorithm is O((nB)1+1/p B 2 /2 ). The space requirement is the same as that of the simpler algorithm; namely, O(RB log n). Theorem 3.10. The algorithm described above (with the new rounding scheme) uses O(−1 B 3 n1/p log2 n) space and computes a (1 + ) approximation to the best B-term 65

−

−

+

−

+

− +

+

− +

+

3

− +

−

+

−

−

−

− +

−

−

+

− + −

−

+

−

+

− +

+

−

+

2

−

+

1

+

0

−

+

− +

−

+

−

+

+

+

−

−

+

+

− +

+

+

−

+

−

−

+

+

−

+

+

−

+

+

+

0

−

−

+

1

−

−

+

− +

− +

+

2

−

−

3

Figure 3.3: The support regions and coefficient tree structure of the 2-dimensional Haar basis vectors for a 4 × 4 data array. Each node other than the root contains the 3 wavelet coefficients having the same support region.

unrestricted representation of a signal in the Haar system under the `p norm. It runs in time O(−2 (nB)1+1/p B 2 log n). Again, and as in Theorem 3.8, the extra B factor in the space requirement accounts for keeping track of the chosen coefficients, and the extra log n factor in both the space and time requirements accounts for the guessing of the error. We choose the better of the two algorithms (or rounding schemes) whose approximation and time and space requirements are guaranteed by Theorems 3.8 and 3.10.

3.4 3.4.1

Extensions PTAS for multi-dimensional Haar Systems

Our algorithm and analysis from Section 3.3 extend to multi-dimensional Haar wavelets when the dimension D is a given constant. For D ≥ 2 define 2D − 1 mother wavelets (see also [38, 39]).. For all integers 0 ≤ d < 2D let ψ d (x) = θd1 (x1 )θd2 (x2 ) · · · θdD (xD ) , where d1 d2 . . . dD is the binary representation of d and θ0 = φ, θ1 = ψ. For d = 0 we obtain the D-dimensional scaling function ψ 0 (x) = φ(x1 )φ(x2 ) · · · φ(xD ). At scale 2j 66

and for s = (s1 , s2 , . . . , sD ) define d ψj,s (x)

−Dj/2

=2

ψ

d

x1 − 2j s1 xD − 2j sD , · · · , 2j 2j

.

d The family {ψj,s }1≤d<2D ,(j,n)∈ZD+1 is an orthonormal basis of L2 (RD ) [64, Thm. 7.25]. a,d b,d d d Note that in multi-dimensions we define ψj,s = 2−Dj/2 ψj,s , ψj,s = 2Dj/2 ψj,s and

a,d

a,d −Dj/2 0 φa,d ψj,s which is analogous to Definition 3.1. Thus ψj,s

= φj,s = 1 j,s = 2 1 1

d

b,d Dj −Dj/2 Dj/2

=2 . Also ψj,s = 1. since ψj,s 1 = 2 2 ∞

As an illustration of a multi-dimensional basis, Figure 3.3 [38, Figure 1,2] shows the support regions and signs of the 16 two-dimensional Haar basis vectors corresponding to a 4 × 4 input array. The figure shows that the support region of say 3 ψ1,(0,0) , which corresponds to entry (2, 2) in the array, is the 2jD (j = 1, D = 2) ele-

ments comprising the lower left quadrant of the input array. Figure 3.3 also shows the coefficient tree for this case. In general, each node in the tree has 2D children and corresponds to 2D − 1 coefficients (assuming the input is a hypercube). The structure of the coefficient tree will result in a O(R2

D −1

) increase in running time

over the one-dimensional case where R = O(−1 n1/p log n). As in Section 3.3.1, we associate an error array E[i, b, v] with each node i in the tree where v is the result of the choices of i’s ancestors and b ≤ B is the number of coefficients used by the subtree rooted at i. The size of each table is thus O(min{2Dj , B}R) where j is the level of the tree to which i belongs. When computing an entry E[i, b, v] in the table, we need to choose the best non-zero subset S of the 2D − 1 coefficients that belong to the node and the best assignment of values to these |S| coefficients. These choices contribute a factor O((R + 1)2

D −1

)

to the time complexity. We also have to choose the best partition of the remaining b − |S| coefficients into 2D parts adding another O((min{2Dj , B})2

D −1

) factor to the

running time. We can avoid the latter factor by ordering the search among the node’s children as in [38, 39]. Each node is broken into 2D − 1 subnodes: Suppose node i has children c1 , . . . , c2D ordered in some manner. Then subnode it , will have ct as its left child and subnode it−1 as its right child. Subnode i2D −1 will have c2D −1 and 67

c2D as its children. Now all subnode it needs to do is search for the best partition of b into 2 parts as usual. Specifically, fix S and the values given to the coefficients in S. For each b0 , 0 ≤ b0 ≤ min{2Dj , b − |S|}, each subnode starting from i2D −1 up to i2 tabulates the best allotment of b0 coefficients to its two children. Subnode i1 then computes the best partition of b − |S| coefficients into two parts. This process takes O(min{2Dj , B} + (2D − 2)(min{2Dj , B})2 ) = O(2D (min{2Dj , B})2 ) time per node. For `∞ the bounds are better because we can perform binary search on the partition of b0 −|S| into two parts. All the error arrays for the subnodes are discarded before considering the next choice of S and values assigned to its elements. Hence, assuming the input is of size N , and since there are N/2Dj nodes per level of the coefficient tree, the total running time is  log N  D X N 2D −1 D Dj 3 2 2D O (R + 1) 2 R(min{2 , B}) = O(N B R ) Dj 2 j=1 where we dropped the constant factors involving D in the final expression and we used a calculation similar to that in Section 3.3.2. Notice that dropping the factor 2D − 2 from the computations of the subnodes results in a running time that does not reduce to that of the one-dimensional case (which was O(nBR2 )); however, we chose to drop it above for clarity. Finally recall from Section 3.3.1 that we need to make O(log N ) guesses of the error E.

3.4.2

QPTAS for General Compact Systems

We show a simple dynamic programming algorithm that finds a (1+)-approximation to the wavelet synopsis construction problem under the `∞ norm. The algorithm uses g(q, n) = nO(q(log q+log log n)) time and space. Under the `p norm, the algorithm uses nO(q(log q+

log n )) p

time and space. We will describe the algorithm for the Daubechies

D2 wavelet under the `∞ norm. Recall that the Daubechies filters have 2q non-zero coefficients (cf. Example II and Figure 2.3 in Section 2.3). 68

For a given subproblem, call an edge an interface edge if exactly one of its endpoints is in the subproblem. Each interface edge has a value associated with it which is eventually determined at a later stage. We will maintain that each subproblem has at most 4q log n interface edges. A subproblem has a table E associated with it where for each b ≤ B and each configuration I of values on interface edges, E[b, I] stores the minimum contribution to the overall error when the subproblem uses b coefficients and the interface configuration is I. From Lemma 3.7, setting ρ = E/(c1 q log n) for some suitably large constant c1 , each interface edge can have one of V = O( q

3/2

log n )

values under the `∞ norm. Hence, the size of E is bounded by BV 4q log n = g(q, n). The algorithm starts with an initialization phase that creates the first subproblem. This phase essentially flattens the cone-shape of the coefficient graph, and the only difference between it and later steps is that it results in one subproblem as opposed to two. We select any 2q consecutive leaves in the coefficient graph and their ancestors. This is at most 2q log n nodes. We will guess the coefficients of the optimal solution associated with this set of nodes. Again, from Lemma 3.7, each coefficient can take one of W = O( q

3/2

log n )

values under the `∞ norm. For each of

the (2W )2q log n = g(q, n) guesses, we will run the second phase of the algorithm. In the second phase, given a subproblem A, we first select the 2q ‘middle’ leaves and their ancestors. Call this strip of nodes S. Note that |S| ≤ 2q log n. The nodes in S break A into two smaller subproblems L and R (see Figure 3.4). Suppose we have EL and ER , the two error arrays associated with L and R respectively. We compute each entry EA [b, I] as follows. First, we guess the b0 non-zero coefficients of the optimal solution associated with the nodes in S and their values. Combined with the configuration I, these values define a configuration IL (resp. IR ) for the interface edges of L (resp. R) in the obvious way. Furthermore, they result in an error e associated with the leaf nodes in S. Hence, E[b, I] = e + min max{EL [b00 , IL ], ER [b − b0 − b00 , IR ]} . 00 b

Therefore, computing each entry in E takes at most B(2W )2q log n = g(q, n) time. 69

S

R

L

Figure 3.4: An example D2 subproblem. The shaded nodes belong to the strip S. The edges crossing the ‘frontier’ are interface edges.

The running time of the algorithm follows. Theorem 3.11. We can compute a (1 + ) approximation to the best B-term unrestricted representation of a compact system under the `∞ norm in nO(q(log q+log log n)) time. The result also extends to `p norms, but remains a quasi-polynomial time algorithm. The main point of the above theorem is that the representation problem is not Max-SNP-Hard.

3.4.3

Workloads

The algorithm and analysis from Section 3.3 also extend to weighted cases/workloads P under the same assumptions as in [66]. Namely, given f and {wi } where ni=1 wi = 1 and 0 < wi ≤ 1, we wish to find a solution {zi } with at most B non-zero coefficients that minimizes,

X

zi ψi

f − i

p,w ~

=

X j

p 1/p X wip f (j) − zi ψi (j) . i

Letting w = mini wi and W = maxi wi , we will show how our approximation algorithm extends to this case with a factor 2 factor W increase in running time. w

W w

70

increase in its space requirement and a

The following three lemmas are analogs of lemmas 3.3, 3.6 and 3.4 respectively. The first two are straightforward, but note the factor W in the additive approximation. Lemma 3.12. Let {yi∗ } be the optimal solution using the basis set {ψib } for the P reconstruction, i.e., fˆ = i yi∗ ψib and kf − fˆkp,w~ = E. Let {yiρ } be the set where P each yi∗ is rounded to the nearest multiple of ρ. If f ρ = i yiρ ψib then kf − f ρ kp,w~ ≤ E + O(qW n1/p ρ log n). √ √ Lemma 3.13. −C1 q wE ≤ hf, φaj,s i − hfˆ, φaj,s i ≤ C1 q wE for some constant C1 . √ √ a Lemma 3.14. −C0 q wE ≤ hf, ψj,s i − yi∗ ≤ C0 q wE for some constant C0 . Proof. For all j we have wj |f (j) −

P

i

yi∗ ψib (j)| ≤ E. Multiplying by |ψka (j)| and

summing over all j we get X

wj |f (j)ψka (j) −

X

j

yi∗ ψib (j)ψka (j)| ≤ kψka k1 E

i

X X X ⇒ w| f (j)ψka (j) − yi∗ ψi (j)ψk (j)| ≤ kψka k1 E j

⇒ w|hf, ψka i − yk∗ | ≤

i

j

p

2qE ,

completing the proof. Hence, setting ρ = E/(cqW n1/p log n) for some suitably large constant c, we get the desired approximation with R from the analysis above equal to O(E/(wρ)) = O( W q−1 n1/p log n). w

3.4.4

Quality Versus Time

A natural question arises, if we were interested in the restricted synopses only, can we develop streaming algorithms for them? The answer reveals a rich tradeoff between synopsis quality and running time. 71

Observe that if at each node we only consider either storing the coefficient or 0, then we can limit the search significantly. Instead of searching over all v + r to the left and v − r to the right in the dynamic program (which we repeat below)

  min 0 E[i , v + r, b0 ] + E[i , v − r, b − b0 − 1] r,b L R min  min 0 E[i , v, b0 ] + E[i , v, b − b0 ] b

L

,

R

we only need to search for r = hf, ψia i—observe that a streaming algorithm can compute hf, ψia i (See [42]). However we have to “round” hf, ψia i to a multiple of ρ since we are storing the table corresponding to the multiples of ρ between hf, φai i − √ √ C1 qE and hf, φai i + C1 qE. We consider the better of rounding up or rounding down hf, ψia i to the nearest multiple of ρ. The running time improves by a factor of R in this case since in order to compute each entry we are now considering only two values of Ri (round up/down) instead of the entire set. The overall running time is O(RnB) in the general case and O(Rn log2 B) for the `∞ variants. The space bound and the approximation guarantees remain unchanged. However the guarantee is now against the synopsis which is restricted to storing wavelet coefficients. The above discussion sets the ground for investigating a variety of Hybrid algorithms where we choose different search strategies for each coefficient. We introduced this idea in [51] but in the context of a weaker approximation strategy. One strategy we explore in Section 3.6 is to allow the root node to range over the set R1 while considering the better of rounding up or rounding down hf, ψia i to the nearest multiple of ρ for all other coefficients (i > 1). We show that this simple modification improves on the quality of the restricted synopsis and on the running time of the unrestricted algorithm. 72

3.5

Best Basis Selection from a Dictionary

In this section we show how our algorithms can be extended to find representations in certain types of tree-structured dictionaries. Specifically, the dictionaries we consider are full binary tree-structured dictionaries composed of compactly supported wavelets. Given f ∈ Rn , B ∈ Z and such a dictionary D, we now wish to find the best B-term representation of f in a basis from D. Notice that we seek both the best basis in D for representing f using B terms and the best B-term representation of f in this basis. The error of the representation is its `p distance from f . We show in Theorem 3.17 how our algorithms from the previous sections can be used to find provably-approximate answers to this bi-criteria optimization problem. We start with the description of our tree-structured dictionaries. Similar to Coiffman and Wickerhauser [17], our dictionaries will be composed of O(n log n) n

vectors, and will contain 2O( 2 ) bases: equal to the number of cuts in a complete binary tree. Let a(j,p) = 2j p and let g(j,p) [t] = 1[a(j,p) ,

a(j,p+1) −1] [t]

be the discrete dyadic window

that is 1 in [a(j,p) , a(j,p+1) − 1] and zero elsewhere. Each node in D is labeled by (j, p), 0 ≤ j ≤ log n, 0 ≤ p ≤ n2−j − 1, where j is the height of the node in the tree (the root is at height log n), and p is the number of nodes to its left that are at the same height in a complete binary tree. With each node (j, p) we associate the subspace W(j,p) of Rn that exactly includes all functions f ∈ Rn whose support lies in g(j,p) . Clearly, W(log n,0) = Rn . Now suppose {ek,l }0≤k 0 is an orthonormal basis for Rl . Then, B(j,p) = ψk,(j,p) [t] = g(j,p) [t]ek,2j [t − a(j,p) ] 0≤k<2j , is an orthonormal basis for W(j,p) . Proposition 3.15. For any internal node (j, p) in the dictionary D, W(j−1,2p) and W(j−1,2p+1) are orthogonal, and W(j,p) = W(j−1,2p) ⊕ W(j−1,2p+1) . 73

Proof. The two spaces W(j−1,2p) and W(j−1,2p+1) are orthogonal since for all k, the inner product hψk,(j−1,2p) , ψk,(j−1,2p+1) i = 0 (as their supports do not overlap). Now any f ∈ W(j,p) can be written as a combination of two blocks: f [t] = g(j−1,2p) [t]f [t] + g(j−1,2p+1) [t]f [t]; and the first block (resp. second block) can be represented using {ψk,(j−1,2p) }k (resp. {ψk,(j−1,2p+1) }k ). We can thus construct an orthonormal basis of W(j,p) via a union of orthonormal bases of W(j−1,2p) and W(j−1,2p+1) . Corollary 3.16. Let {(ji , pi )} be the set of nodes corresponding to a cut in the dictionary tree. We have, M i

W(ji ,pi ) = Rn .

Hence, there are O(2n/2 ) bases in our dictionary. The main result of this section follows. We prove it under the `∞ error measure. The argument is extended to general `p error measures in a straightforward manner. Theorem 3.17. If A is an (streaming) algorithm that achieves a C-approximation for the B-term representation problem under `∞ (for any wavelet included in the dictionary D), then A is a (streaming) C-approximation for the bi-criteria representation problem. Proof. Let E(j,p) [b] be the minimum contribution to the overall error (as computed by A) from representing the block g(j,p) [t]f [t] using b vectors from a basis of W(j,p) . Call the basis that achieves this error the best basis for W(j,p) and denote it by O(j,p) [b]. By Proposition 3.15 there are O(22

j−1

) possible bases for the space W(j,p)

in D. Now if (j, p) is a leaf node, then E(j,p) [b] = A(g(j,p) f, B(j,p) , b), which is the error resulting from representing the block g(j,p) [t]f [t] using b vectors from the basis B(j,p) . Otherwise, if (j, p) is an internal node, E(j,p) [b] equals  0 0   min max E [b ], E [b − b ] (j−1,2p) (j−1,2p+1) 0 , min 0≤b ≤b   A(g (j,p) f, B(j,p) , b) 74

and O(j,p) [b] =

  B(j,p) O

if E(j,p) [b] = A(g(j,p) f, B(j,p) , b)

(j−1,2p) [b(j,p) ]

∪ O(j−1,2p−1) [b − b(j,p) ] else,

where b(j,p) is the argument that minimizes the top expression in E(j,p) [b]. Suppose opt chooses the cut {(jo , po )} with the corresponding partition {bo } of B and we choose the cut {(ji , pi )} with partition {bi }. By the dynamic program above we have, max A(g(ji ,pi ) f, B(ji ,pi ) , bi ) = max E(ji ,pi ) [bi ] i

i

(3.1a)

≤ max E(jo ,po ) [bo ]

(3.1b)

≤ max A(g(jo ,po ) f, B(jo ,po ) , bo )

(3.1c)

≤ C max opt(g(jo ,po ) f, B(jo ,po ) , bo )

(3.1d)

= C opt ,

(3.1e)

o

o

o

where (3.1b) follows from the fact that our dynamic program chooses the best cut and corresponding partition of B among all possible cuts and partitions based on the errors computed by algorithm A; (3.1c) follows from the definition of our dynamic programming table entries E(j,p) [b]; (3.1d) follows from the assumption that A is a Capproximation algorithm; and (3.1e) follows from the optimal substructure property of our problem.

3.6

Comparing the Restricted and Unrestricted optimizations

We consider two issues in this section, namely (i) the quality of the unrestricted version vis-a-vis the restricted optimum solution and (ii) the running times of the algorithms. We will restrict our experiments to the `∞ norm. 75

3.6.1

The algorithms

All experiments reported in this section were performed on a 2 CPU Pentium-III 1.4 GHz with 2GB of main memory, running Linux. All algorithms were implemented using version 3.3.4 of the gcc compiler. We show the performance figures of the following schemes: REST This characterizes the algorithms for the restricted version of the problem. This is the O(n2 ) time O(n) space algorithm in [46] (see also [38, 65, 71]). UNREST This is the streaming algorithm for the full general version described in Algorithm HaarPTAS based on the discussion in Section 3.34 . HYBRID This is the streaming hybrid algorithm proposed in Section 3.4.4. Note that the UNREST and HYBRID algorithms are not the additive approximation algorithms in [51] (although we kept the same names).

3.6.2

The Data Sets

We chose a synthetic dataset to showcase the point made in the introduction about the sub-optimality of the restricted versions. Otherwise we use a publicly available real life data set for our experiment. • Saw: This is a periodic dataset with a line repeated 8 times, with 2048 values total. The dataset is shown in Figure 3.5. • DJIA data set: We used the Dow-Jones Industrial Average (DJIA) data set available at StatLib5 that contains Dow-Jones Industrial Average (DJIA) closing values from 1900 to 1993. There were a few negative values (e.g. −9), which we removed. We focused on prefixes of the dataset of sizes up to 16384. The dataset is shown in Figure 3.6. 4 5

The implementation is available from http://www.cis.upenn.edu/∼boulos. See http://lib.stat.cmu.edu/datasets/djdc0093.

76

120

"Saw-2048"

110 100 90 80 70 60 50 40 30 20 10 0

500

1000

1500

2000

2500

Figure 3.5: The Saw dataset.

500

"Dow-data"

450 400 350 300 250 200 150 100 50 0 0

2000

4000

6000

8000 10000 12000 14000 16000

Figure 3.6: The DJIA dataset.

77

26

UNREST REST HYBRID

24 22

Maximum Error

20 18 16 14 12 10 8 6 20

25

30

35

40

45

50

55

60

Number of coefficients, B

Figure 3.7: The `∞ error of the three algorithms, UNREST, REST, and HYBRID for the Saw dataset (n = 2048).

3.6.3

Quality of Synopsis

The `∞ errors as a function of B are shown in figures 3.7 and 3.8. The in the approximation algorithms UNREST and HYBRID was set to 1. All the algorithms gave very similar synopses for the Saw data and had almost the same errors. In case of the Dow data we show the range B = 5 onward since the maximum value is ∼ 500 and the large errors for B < 5 (for all algorithms) bias the scale making the differences in the more interesting ranges not visible. The algorithm REST has more than 20% worse error compared to UNREST or requires over 35% more coefficients to achieve the same error (for most error values). The HYBRID algorithm performs consistently in the middle.

3.6.4

Running Times

Figure 3.9 shows the running times of the algorithms as the prefix size n is varied for the Dow data. As mentioned above was set to 1. The grid in the log-log plot helps us clearly identify the quadratic nature of REST. The algorithms UNREST and 78

180

UNREST REST HYBRID

160

Maximum Error

140

120

100

80

60 6

8

10

12 14 Number of coefficients, B

16

18

20

Figure 3.8: The `∞ error of the three algorithms, UNREST, REST, and HYBRID for the Dow data set (n = 16384).

256

UNREST REST HYBRID

64

16

4

1

0.25

0.0625

0.015625 256

512

1024

2048

4096

8192

16384

Figure 3.9: Running times of the three algorithms, UNREST, REST, and HYBRID for prefixes of the Dow data set.

79

HYBRID behave linearly as is expected from streaming algorithms. Given its speed and quality, the HYBRID algorithm seems to be the best choice from a practical perspective.

80

Chapter 4 Sampling Algorithms and Coresets for `p Regression 4.1

Background

An important question in algorithmic problem solving is whether there exists a small subset of the input such that if computations are performed only on this subset, then the solution to the given problem can be approximated well. Such a subset is often known as a coreset for the problem. The concept of coresets has been extensively used in solving many problems in optimization and computational geometry; e.g., see the excellent survey by Agarwal, Har-Peled, and Varadarajan [2]. In this chapter, we construct coresets and obtain efficient sampling algorithms for the classical `p regression problem, for all p ∈ [1, ∞). Recall the `p regression problem: Problem 4.1 (`p regression problem). Let k·kp denote the p-norm of a vector. Given as input a matrix A ∈ Rn×m , a target vector b ∈ Rn , and a real number p ∈ [1, ∞), find a vector xopt and a number Z such that Z = minm kAx − bkp = kAxopt − bkp . x∈R

81

(4.1)

We will use the following `p regression coreset concept: Definition 4.1 (`p regression coreset). Let 0 < < 1. A coreset for Problem 4.1

ˆ − ˆb , where Aˆ is is a set of indices I such that the solution xˆopt to minx∈Rm Ax p composed of those rows of A whose indices are in I and ˆb consists of the corresponding elements of b, satisfies kAˆ xopt − bkp ≤ (1 + ) minx kAx − bkp . If n m, i.e., if there are many more constraints than variables, then (4.1) is an overconstrained `p regression problem. In this case, there does not in general exist a vector x such that Ax = b, and thus Z > 0. Overconstrained regression problems are fundamental in statistical data analysis and have numerous applications in applied mathematics, data mining, and machine learning [13, 55]. Even though convex programming methods can be used to solve the overconstrained regression problem in time O((mn)c ), for c > 1, this is prohibitive if n is large.1 This raises the natural question of developing more efficient algorithms that run in time O(mc n), for c > 1, while possibly relaxing the solution to Equation (4.1). In particular: Can we get a κ-approximation to the `p regression problem, i.e., a vector xˆ such that kAˆ x − bkp ≤ κZ, where κ > 1? Note that a coreset of small size would strongly satisfy our requirements and result in an efficiently computed solution that’s almost as good as the optimal. Thus, the question becomes: Do coresets exist for the `p regression problem, and if so can we compute them efficiently? Our main result is an efficient two-stage sampling-based approximation algorithm that constructs a coreset and thus achieves a (1 + )-approximation for the `p regression problem. The first-stage of the algorithm is sufficient to obtain a (fixed) constant factor approximation. The second-stage of the algorithm carefully uses the output of the first-stage to construct a coreset and achieve arbitrary constant factor approximation. 1

For the special case of p = 2, vector space methods can solve the regression problem in time O(m2 n), and if p = 1 linear programming methods can be used.

82

4.1.1

Related work

Clarkson studied sampling algorithms for the `1 regression problem [15]. He preprocessed the input matrix A by using the ellipsoid method to make A well-behaved with respect to the `1 norm, and he also applied a subgradient-descent-based approximation algorithm to guarantee that the `1 norm of the vector b was conveniently bounded. He proved that coresets exist and can be computed efficiently—for this modified `1 regression problem [15]. For the `2 case, Drineas, Mahoney and Muthukrishnan [34] designed sampling strategies to preserve the subspace information of A and proved the existence of a coreset of rows of size O(d2 /2 )—for the original `2 regression problem, where d is the rank of A; this leads to a (1 + )approximation algorithm. While their algorithm used O(nd2 ) time to construct the coreset and solve the `2 regression problem—which is sufficient time to solve the regression problem—in a subsequent work, Sarl´os [76] improved the running time ˜ for solving the regression problem to O(nd) by using random sketches based on the Fast Johnson–Lindenstrauss transform of Ailon and Chazelle [4]. f (d)

More generally, embedding d-dimensional subspaces of Lp into `p

using coordi-

nate restrictions has been extensively studied [9,77,78,81,82]. Using well-conditioned bases, one can provide a constructive analog of Schechtman’s existential L1 embedding result [77] (see also [9]), that any d-dimensional subspace of L1 (0, 1) can be embedded in `r1 with distortion (1 + ) with r = O(d2 /2 ), albeit with an extra √ factor of d in the sampling complexity. Coresets have been analyzed by the computation geometry community as a tool for efficiently approximating various extent measures [2, 3]; see also [7, 36, 54] for applications of coresets in combinatorial optimization. An important difference is that most of the coreset constructions are exponential in the dimension, and thus applicable only to low-dimensional problems, whereas our coresets are polynomial in the dimension, and thus applicable to high-dimensional problems. Sampling techniques for matrix approximation have also been studied; see, e.g., [30–32, 37, 74] and references therein. 83

4.2

Our contributions

Summary of results. For simplicity of presentation, we summarize the results for the case of m = d = rank(A). Let k = max{p/2 + 1, p} and let φ(r, d) be the time required to solve an `p regression problem with r constraints and d variables. In the first stage of the algorithm, we compute a set of sampling probabilities p1 , . . . , pn in time O(nd5 log n), sample rb1 = O(36p dk+1 ) rows of A and the corresponding elements of b according to the pi ’s, and solve an `p regression problem on the (much smaller) sample; we prove this is an 8-approximation algorithm with a running time of O (nd5 log n + φ(rb1 , d)). In the second stage of the algorithm, we use the residual from the first stage to compute a new set of sampling probabilities q1 , . . . , qn , sample additional rb2 = O(rb1 /2 ) rows of A and the corresponding elements of b according to the qi ’s, and solve an `p regression problem on the (much smaller) sample; we prove this is a (1 + )-approximation algorithm with a total running time of O (nd5 log n + φ(rb2 , d)) (Section 4.5). We also show how to extend our basic algorithm to commonly encountered and more general settings of constrained, generalized, and weighted `p regression problems (Section 4.6). Our algorithm unifies, improves upon, and extends the existing algorithms for special cases of `p regression, namely p = 1, 2, which we mentioned earlier. For example, for p = 1, it improves the sampling complexity of the result of Clarkson [15] from O(d3.5 log d/2 ) to O(d2.5 /2 ); and, for p = 2, it matches the sampling complexity of the result of Drineas, Mahoney, and Muthukrishnan [34], which is O(d2 /2 ). Our result also implicitly establishes the existence of an efficiently computable coreset for the overconstrained `p regression problem. Technical highlights. In course of proving our result, we develop two technical concepts and tools that are of independent interest (Section 4.4). (1) Well-conditioned bases. Informally speaking, if U is a well-conditioned basis, then for all z ∈ Rd , kzkp should be close to kU zkp . We will formalize this by requiring that 84

for all z ∈ Rd , kzkq multiplicatively approximates kU zkp by a factor that can depend on d but is independent of n (where p and q are conjugate; i.e., q = p/(p − 1)). We show that these bases exist and can be constructed in time O(nd5 log n). In fact, our notion of a well-conditioned basis can be interpreted as a computational analog of the Auerbach and Lewis bases studied in functional analysis [89]. They are also related to the barycentric spanners recently introduced by Awerbuch and R. Kleinberg [6] (Section 4.4.1). J. Kleinberg and Sandler [60] defined the notion of an `1 -independent basis, and our well-conditioned basis can be used to obtain an exponentially better “condition number” than their construction. Further, Clarkson [15] defined the notion of an “`1 -conditioned matrix,” and he preprocessed the input matrix to an `1 regression problem so that it satisfies conditions similar to those satisfied by our bases. (2) Subspace-preserving sampling. We show that sampling rows of A according to information in the rows of a well-conditioned basis of A minimizes the sampling variance and consequently, the rank of A is not lost by sampling. This is critical for our relative-error approximation guarantees. The notion of subspace-preserving sampling was used in [34] for p = 2, but we abstract and generalize this concept for all p ∈ [1, ∞).

Overview of our methods. Given an input matrix A, we first construct a wellconditioned basis for A and use that to obtain bounds on a slightly non-standard notion of a p-norm condition number of a matrix. The use of this particular condition number is crucial since the variance in the subspace preserving sampling can be upper bounded in terms of it. An ε-net argument then shows that the first stage sampling gives us a 8-approximation. The next twist is to use the output of the first stage as a feedback to fine-tune the sampling probabilities. This is done so that the “positional information” of b with respect to A is also preserved in addition to the subspace. A more careful use of a different ε-net shows that the second stage sampling achieves 85

a (1 + )-approximation.

4.3

Preliminaries

Given a vector x ∈ Rm , its p-norm is kxkp =

Pm

p 1/p , i=1 (|xi | )

and the dual norm of

k·kp is denoted k·kq , where 1/p + 1/q = 1. Given a matrix A ∈ Rn×m , its generalized P P p 1/p p-norm is |||A|||p = ( ni=1 m . This is a submultiplicative matrix norm j=1 |Aij | ) that generalizes the Frobenius norm from p = 2 to all p ∈ [1, ∞), but it is not a vector-induced matrix norm. The j-th column of A is denoted A?j , and the i-th P P row is denoted Ai? . In this notation, |||A|||p = ( j kA?j kpp )1/p = ( i kAi? kpp )1/p . For x, x0 , x00 ∈ Rm , it can be shown using H¨older’s inequality that kx − x0 kpp ≤ 2p−1 kx − x00 kpp + kx00 − x0 kpp . Two crucial ingredients in our proofs are ε-nets and tail-inequalities. A subset N (D) of a set D is called an ε-net in D for some ε > 0 if for every x ∈ D, there is a y ∈ N (D) with kx − yk ≤ ε. In order to construct an ε-net for D it is enough to choose N (D) to be the maximal set of points that are pairwise ε apart. It is well known that the unit ball of a d-dimensional space has an ε-net of size at most (3/ε)d [9]. With respect to tail-inequalities, we will use the following version of Bernstein’s inequality. Theorem 4.1 ( [8,68]). Let {Xi }ni=1 be independent random variables with E[Xi2 ] < P ∞ and Xi ≥ 0. Set Y = i Xi and let γ > 0. Then −γ 2 P . (4.2) Pr [Y ≤ E[Y ] − γ] ≤ exp 2 i E[Xi2 ] If Xi − E[Xi ] ≤ ∆ for all i, then with σi2 = E[Xi2 ] − E[Xi ]2 we have −γ 2 P Pr [Y ≥ E[Y ] + γ] ≤ exp . 2 i σi2 + 2γ∆/3

(4.3)

Finally, throughout this chapter, we will use the following sampling matrix formalism to represent our sampling operations. Given a set of n probabilities, pi ∈ 86

(0, 1], for i = 1, . . . , n, let S be an n × n diagonal sampling matrix such that Sii is 1/p

set to 1/pi

with probability pi and to zero otherwise. Clearly, premultiplying A

or b by S determines whether the i-th row of A and the corresponding element of b will be included in the sample, and the expected number of rows/elements selected is P r0 = ni=1 pi . (In what follows, we will abuse notation slightly by ignoring zeroed out rows and regarding S as an r0 × n matrix and thus SA as an r0 × m matrix.) Thus, e.g., sampling constraints from Equation (4.1) and solving the induced subproblem may be represented as solving Zˆ = minm kSAˆ x − Sbkp . x ˆ∈R

(4.4)

A vector xˆ is said to be a κ-approximation to the `p regression problem of Equax − bkp ≤ κZ. Finally, the Appendix contains all the tion (4.1), for κ ≥ 1, if kAˆ missing proofs.

4.4 4.4.1

Main technical ingredients Well-conditioned bases

We introduce the following notion of a “well-conditioned” basis. Definition 4.2 (Well-conditioned basis). Let A be an n × m matrix of rank d, let p ∈ [1, ∞), and let q be its dual norm. Then an n × d matrix U is an (α, β, p)-wellconditioned basis for the column space of A if (1) |||U |||p ≤ α, and (2) for all z ∈ Rd , kzkq ≤ β kU zkp . We will say that U is a p-well-conditioned basis for the column space of A if α and β are dO(1) , independent of m and n. √ Recall that any orthonormal basis U for span(A) satisfies both |||U |||2 = kU kF = d √ and also kzk2 = kU zk2 for all z ∈ Rd , and thus is a ( d, 1, 2)-well-conditioned basis. Thus, Definition 4.2 generalizes to an arbitrary p-norm, for p ∈ [1, ∞), the notion that an orthogonal matrix is well-conditioned with respect to the 2-norm. Note also 87

that duality is incorporated into Definition 4.2 since it relates the p-norm of the vector z ∈ Rd to the q-norm of the vector U z ∈ Rn , where p and q are dual.2 The existence and efficient construction of these bases is given by the following. Theorem 4.2. Let A be an n × m matrix of rank d, let p ∈ [1, ∞), and let q be its dual norm. Then there exists an (α, β, p)-well-conditioned basis U for the column 1

1

1

space of A such that: if p < 2, then α = d p + 2 and β = 1, if p = 2, then α = d 2 and 1

1

1

1

β = 1, and if p > 2, then α = d p + 2 and β = d q − 2 . Moreover, U can be computed in O(nmd + nd5 log n) time (or in just O(nmd) time if p = 2). Proof. Let A = QR, where Q is any n × d matrix that is an orthonormal basis for span(A) and R is a d × m matrix. If p = 2, then Q is the desired basis U ; from the √ discussion following Definition 4.2, α = d and β = 1, and computing it requires O(nmd) time. Otherwise, fix Q and p and define the norm, kzkQ,p , kQzkp . A quick check shows that k·kQ,p is indeed a norm. - kzkQ,p = 0 if and only if z = 0 since Q has full column rank; - kγzkQ,p = kγQzkp = |γ| kQzkp = |γ| kzkQ,p ; and - kz + z 0 kQ,p = kQ(z + z 0 )kp ≤ kQzkp + kQz 0 kp = kzkQ,p + kz 0 kQ,p . Consider the set C = {z ∈ Rd : kzkQ,p ≤ 1}, which is the unit ball of the norm k·kQ,p . In addition, define the d × d matrix F such that Elj = {z ∈ Rd : z T F z ≤ 1} is √ the L¨owner–John ellipsoid of C. Since C is symmetric about the origin, (1/ d)Elj ⊆ C ⊆ Elj ; thus, for all z ∈ Rd , kzklj ≤ kzkQ,p ≤

√

d kzklj ,

(4.5)

where kzk2lj = z T F z (see, e.g. [10, pp. 413–4]). Since the matrix F is symmetric positive definite, we can express it as F = GT G, where G is full rank and upper triangular. Since Q is an orthogonal basis for span(A) and G is a d × d matrix of 2

For p = 2, Drineas, Mahoney, and Muthukrishnan used this basis, i.e., an orthonormal matrix, to construct probabilities to sample the original matrix. For p = 1, Clarkson used a procedure similar to √ the one we describe in the proof of Theorem 4.2 to preprocess A such that the 1-norm of z is a d d factor away from the 1-norm of Az.

88

full rank, it follows that U = QG−1 is an n × d matrix that spans the column space of A. We claim that U , QG−1 is the desired p-well-conditioned basis. To establish this claim, let z 0 = Gz. Thus, kzk2lj = z T F z = z T GT Gz = (Gz)T Gz = z 0 T z 0 = kz 0 k22 . Furthermore, since G is invertible, z = G−1 z 0 , and thus kzkQ,p = kQzkp = kQG−1 z 0 kp . By combining these expression with (4.5), it follows that for all z 0 ∈ Rd , kz 0 k2 ≤ kU z 0 kp ≤ Since |||U |||pp =

P

j

kU?j kpp =

P

j

kU ej kpp ≤

P

j

√

d kz 0 k2 .

p

(4.6) p

d 2 kej kp2 = d 2 +1 , where the inequality 1

1

follows from the upper bound in (4.6), it follows that α = d p + 2 . If p < 2, then q > 2 and kzkq ≤ kzk2 for all z ∈ Rd ; by combining this with (4.6), it follows that β = 1. 1

1

On the other hand, if p > 2, then q < 2 and kzkq ≤ d q − 2 kzk2 ; by combining this 1

1

with (4.6), it follows that β = d q − 2 . In order to construct U , we need to compute Q and G and then invert G. Our matrix A can be decomposed into QR using the compact QR decomposition in O(nmd) time. The matrix F describing the L¨owner–John ellipsoid of the unit ball of k·kQ,p can be computed in O(nd5 log n) time. Finally, computing G from F takes O(d3 ) time, and inverting G takes O(d3 ) time. Connection to barycentric spanners. A point set K = {K1 , . . . , Kd } ⊆ D ⊆ Rd is a barycentric spanner for the set D if every z ∈ D may be expressed as a linear combination of elements of K using coefficients in [−C, C], for C = 1. When C > 1, K is called a C-approximate barycentric spanner. Barycentric spanners were introduced by Awerbuch and R. Kleinberg in [6]. They showed that if a set is compact, then it has a barycentric spanner. Our proof shows that if A is an √ n × d matrix, then τ −1 = R−1 G−1 ∈ Rd×d is a d-approximate barycentric spanner −1 for D = {z ∈ Rd : kAzkp ≤ 1}. To see this, first note that each τ?j belongs to −1 D since kAτ?j kp = kU ej kp ≤ kej k2 = 1, where the inequality is obtained from

Equation (4.6). Moreover, since τ −1 spans Rd , we can write any z ∈ D as z = τ −1 ν. 89

Hence,

kνk kνk∞ √ ≤ √ 2 ≤ kU νkp = Aτ −1 ν p = kAzkp ≤ 1 , d d where the second inequality is also obtained from Equation (4.6). This shows that our basis has the added property that every element z ∈ D can be expressed as a linear combination of elements (or columns) of τ −1 using coefficients whose `2 norm √ is bounded by d. Connection to Auerbach bases. An Auerbach basis U = {U?j }dj=1 for a ddimensional normed space A is a basis such that kU?j kp = 1 for all j and such that P whenever y = j νj U?j is in the unit ball of A then |νj | ≤ 1 for all j. The existence of such a basis for every finite dimensional normed space was first proved by Herman Auerbach [5] (see also [24, 83]). It can easily be shown that an Auerbach basis is an 1

1

(α, β, p)-well-conditioned basis, with α = d p and β = d q for all p. Further, suppose U is an Auerbach basis for span(A), where A is an n × d matrix of rank d. Writing A = U τ , it follows that τ −1 is an exact barycentric spanner for D = {z ∈ Rd : −1 −1 kAzkp ≤ 1}. Specifically, each τ?j ∈ D since kAτ?j kp = kU?j kp = 1. Now write

z ∈ D as z = τ −1 ν. Since the vector y = Az = U ν is in the unit ball of span(A), we have |νj | ≤ 1 for all 1 ≤ j ≤ d. Therefore, computing a barycentric spanner for the compact set D—which is the pre-image of the unit ball of span(A)—is equivalent (up to polynomial factors) to computing an Auerbach basis for span(A).

4.4.2

Subspace-preserving sampling

In the previous subsection (and in the notation of the proof of Theorem 4.2), we saw that given p ∈ [1, ∞), any n × m matrix A of rank d can be decomposed as A = QR = QG−1 GR = U τ , where U = QG−1 is a p-well-conditioned basis for span(A) and τ = GR. The significance of a p-well-conditioned basis is that we are able to minimize the variance 90

in our sampling process by randomly sampling rows of the matrix A and elements of the vector b according to a probability distribution that depends on norms of the rows of the matrix U . This will allow us to preserve the subspace structure of span(A) and thus to achieve relative-error approximation guarantees. More precisely, given p ∈ [1, ∞) and any n × m matrix A of rank d decomposed as A = U τ , where U is an (α, β, p)-well-conditioned basis for span(A), consider any set of sampling probabilities pi for i = 1, . . . , n, that satisfy: ( ) kUi? kpp pi ≥ min 1, r , |||U |||pp

(4.7)

where r = r(α, β, p, d, ) to be determined below. Let us randomly sample the ith row of A with probability pi , for all i = 1, . . . , n. Recall that we can construct a diagonal 1/p

sampling matrix S, where each Sii = 1/pi

with probability pi and 0 otherwise, in

which case we can represent the sampling operation as SA. The following theorem is our main result regarding this subspace-preserving sampling procedure. Theorem 4.3. Let A be an n × m matrix of rank d, and let p ∈ [1, ∞). Let U be an (α, β, p)-well-conditioned basis for span(A), and let us randomly sample rows of A according to the procedure described above using the probability distribution given by Equation (4.7), where r ≥ 32p (αβ)p (d ln( 12 ) + ln( 2δ ))/(p2 2 ). Then, with probability 1 − δ, the following holds for all x ∈ Rm : | kSAxkp − kAxkp | ≤ kAxkp . Proof. For simplicity of presentation, in this and other proofs we will generally drop the subscript from our matrix and vector p-norms; i.e., unsubscripted norms will be p-norms. Note that it suffices to prove that, for all x ∈ Rm , (1 − )p kAxkp ≤ kSAxkp ≤ (1 + )p kAxkp ,

(4.8)

with probability 1 − δ. To this end, fix a vector x ∈ Rm , define the random variable P Xi = (Sii |Ai? x|)p , and recall that Ai? = Ui? τ since A = U τ . Clearly, ni=1 Xi = 91

kSAxkp . In addition, since E[Xi ] = |Ai? x|p , it follows that

Pn

E[Xi ] = kAxkp . To

(Xi − E[Xi ]) .

(4.9)

i=1

bound Equation (4.8), first note that n X

X

(Xi − E[Xi ]) =

i=1

i:pi <1

Equation 4.9 follows since, according to the definition of pi in Equation (4.7), pi may equal 1 for some rows, and since these rows are always included in the random sample, Xi = E[Xi ] for these rows. To bound the right hand side of Equation 4.9, note that for all i such that pi < 1, |Ai? x|p /pi ≤ kUi? kpp kτ xkpq /pi

(by H¨olders inequality)

≤ |||U |||pp kτ xkpq /r

(by Equation (4.7))

≤ (αβ)p kAxkp /r

(by Definition 4.2 and Theorem 4.2) .

(4.10)

From Equation (4.10), if follows that for each i such that pi < 1, Xi − E[Xi ] ≤ Xi ≤ |Ai? x|p /pi ≤ (αβ)p kAxkp /r; Thus, we may define ∆ = (αβ)p kAxkp /r. In addition, it also follows from Equation (4.10) that X |Ai? x|p E Xi2 = |Ai? x|p pi i:p <1 i:p <1 X i

i

(αβ)p kAxkp X ≤ |Ai? x|p r i:p <1

(by Equation (4.10))

i

≤ (αβ)p kAxk2p /r , from which it follows that

P

i:pi <1

σi2 ≤

P

i:pi <1

E[Xi2 ] ≤ (αβ)p kAxk2p /r.

To apply the upper tail bound in Theorem 4.1, define γ = ((1 + /4)p − 1) kAxkp . It follows that γ 2 ≥ (p/4)2 kAxk2p and also that 2

X

σi2 + 2γ∆/3 ≤ 2(αβ)p kAxk2p /r + 2((1 + /4)p − 1)(αβ)p kAxk2p /3r

i:pi <1

≤ 32p (αβ)p kAxk2p /r, 92

where the second inequality follows by standard manipulations since ≤ 1 and since p ≥ 1. Thus, by Equation (4.3) of Theorem 4.1, it follows that " " # # X X Pr [kSAxkp > kAxkp + γ] = Pr Xi > E Xi + γ i:pi <1

i:pi <1

−γ 2 P ≤ exp 2 i:pi <1 σi2 + 2γ∆/3 ≤ exp −2 p2 r/(αβ)p 32p .

!

Similarly, to apply the lower tail bound of Equation (4.2) of Theorem 4.1, define γ = (1 − (1 − /4)p ) kAxkp . Since γ ≥ kAxkp /4, we can follow a similar line of reasoning to show that p

p

Pr [kSAxk < kAxk − γ] ≤ exp

2

−γ 2 P

i:pi <1

! σi2

≤ exp −2 r/(αβ)p 32 . Choosing r ≥ 32p (αβ)p (d ln( 12 ) + ln( 2δ ))/(p2 2 ), we get that for every fixed x, the d following is true with probability at least 1 − 12 δ: (1 − /4)p kAxkp ≤ kSAxkp ≤ (1 + /4)p kAxkp . Now, consider the ball B = {y ∈ Rn : y = Ax, kyk ≤ 1} and consider an d ε-net for B, with ε = /4. The number of points in the ε-net is 12 . Thus, by the union bound, with probability 1 − δ, Equation (4.8) holds for all points in the ε-net. Now, to show that with the same probability Equation (4.8) holds for all points y ∈ B, let y ∗ ∈ B be such that |kSyk − kyk| is maximized, and let η = sup{|kSyk − kyk| : y ∈ B}. Also, let yε∗ ∈ B be the point in the ε-net that is closest to y ∗ . By the triangle inequality, η = |kSy ∗ k − ky ∗ k| = |kSyε∗ + S(y ∗ − yε∗ )k − kyε∗ + (y ∗ − yε∗ )k| ≤ |kSyε∗ k + kS(y ∗ − yε∗ )k − kyε∗ k + 2 ky ∗ − yε∗ k − ky ∗ − yε∗ k| ≤ |kSyε∗ k − kyε∗ k| + |kS(y ∗ − yε∗ )k − ky ∗ − yε∗ k| + 2 ky ∗ − yε∗ k ≤ /4 kyε∗ k + η/4 + /2 , 93

where the last inequality follows since ky ∗ − yε∗ k ≤ ε, (y ∗ − yε∗ )/ε ∈ B, and |kS(y ∗ − yε∗ )/εk − k(y ∗ − yε∗ )/εk| ≤ η . Therefore, η ≤ since kyε∗ k ≤ 1 and since we assume ≤ 1/7. Thus, Equation (4.8) holds for all points y ∈ B, with probability at least 1 − δ. Similarly, it holds for any y ∈ Rn such that y = Ax, since y/ kyk ∈ B and since kS(y/ kyk) − y/ kykk ≤ implies that kSy − yk ≤ kyk, which completes the proof of the theorem. Several things should be noted about this result. First, it implies that rank(SA) = rank(A), since otherwise we could choose a vector x ∈ null(SA) and violate the theorem. In this sense, this theorem generalizes the subspace-preservation result of Lemma 4.1 of [34] to all p ∈ [1, ∞). Second, regarding sampling complexity: if p

p < 2 the sampling complexity is O(d 2 +2 ), if p = 2 it is O(d2 ), and if p > 2 it is 1

1

1

1

O(dd p + 2 d q − 2 )p = O(dp+1 ). Finally, note that this theorem is analogous to the main result of Schechtman [77], which uses the notion of Auerbach bases. Related subspace embedding results. As mentioned in the introduction, embedding d-dimensional subspaces of Lp (0, 1) into `rp using coordinate restrictions has been extensively studied [9,77,78,81,82]. The result of Schechtman [77] (see also [9]) uses Auerbach bases [5] to show that for every d-dimensional subspace A of L1 (0, 1) and every 0 < < 1/2, there exists a subspace B of `rp such that d(A, B) ≤ 1 + if r ≥ c d2 log(−1 )/2 for a constant C. Here d(A, B) is the Banach-Mazur distance between the two finite-dimensional spaces A and B,

d(A, B) = inf{kT k T −1 : T is a linear map from A to B} . Bourgain, Lindenstrauss and Milman [9] subsequently improved this result to r ≥ c d(log d)2 log( d )/2 using the notion of Lewis bases [62]. Lewis’ theorem generalizes that of F. John [58] on ellipsoids of minimum volume. Bourgain et al. [9] also showed that when 1 < p < 2, r ≥ c(p, )d(log d)3 where c(p, ) is a factor that depends on 94

p and ; and when 2 < p < ∞, r ≥ c(p, )dp/2 log d. Talagrand [81, 82] further improved these results for the cases 1 ≤ p < 2 to c d log d/2 when p = 1, and to c(p, )d log d(log log d)2 when 1 < p < 2. The best known results are summarized in Table 4.1. Surprisingly, when 0 < p < 1, Schechtman and Zvavitch [78] show that A can be embedded in c(p, )d(log d)3 dimensions. p r 1 c d log d/2 (1, 2) c(p, )d log d(log log d)2 (2, ∞) c(p, )dp/2 log d (0, 1) c(p, )d(log d)3

Reference Talagrand [81] Talagrand [82] Bourgain et al. [9] Schechtman and Zvavitch [78]

Table 4.1: Related existential results on embedding subspaces of Lp into `rp . The factor c is a universal constant; and c(p, ) is a factor dependent on p and .

Auerbach bases for d-dimensional normed spaces were defined in Section 4.4.1. If A is an n × d matrix of rank d, then an Auerbach basis for span(A) is a basis H that maximizes det(AT H) subject to kH?j kp ≤ 1 for all j = 1, . . . , d (see, e.g., [89, II.E.11]). A Lewis basis, on the other hand, instead satisfies the constraint 1/2

Pd 2

≤ d1/p (see, e.g., [89, III.B.7]). This constraint bounds the p|H | ?j j=1 p norm of the vector whose components correspond to the 2-norms of the rows of H. As an aside, the following simple proposition shows that when p = 2, an orthonormal basis U for span(A) is also a Lewis basis. Since kU?j k2 = 1 for all j, a consequence of the proposition is that U is also an Auerbach basis. Proposition. Let A be an n × d matrix of rank d. When p = 2, an orthonormal basis for span(A) is also a Lewis basis. P

n i=1

Pd

2

1/2

Proof. When p = 2, the above constraint is equivalent to = j=1 |Hij | √ kHkF ≤ d. Now write A = UA ΣA VAT and H = UH ΣH VHT . Since rank(A) = rank(H) = d, both VA and VH constitute basis vectors for Rd ; hence, det(VA ) = det(VH ) = 1. Further, since det(ΣA ) is a constant, it follows that we need to find H √ that maximizes det(UAT UH ) det(ΣH ) subject to kHkF ≤ d. 95

The constraint kHkF ≤

√

det(ΣH ) =

d and the AM-GM inequality give,

d Y

σi (H) ≤

j=1

d X σj (H)2 j=1

d

!d/2 ≤1 ,

where σj (H), j = 1, . . . , d, are the singular values of H. This bound implies that we cannot do better than setting ΣH = Id . Finally, since both UA and UH are unitary, det(UAT UH ) ≤ 1 where equality is achieved when UH = UA . Therefore, H = UA achieves the maximum objective value. When p 6= 2, we do not know how to compute Auerbach or Lewis bases.

4.5 4.5.1

The sampling algorithm Statement of our main algorithm and theorem

Our main sampling algorithm for approximating the solution to the `p regression problem is presented in Figure 4.1. The algorithm takes as input an n × m matrix A of rank d, a vector b ∈ Rn , and a number p ∈ [1, ∞). It is a two-stage algorithm that returns as output a vector xˆopt ∈ Rm (or a vector xˆc ∈ Rm if only the first stage is run). In either case, the output is the solution to the induced `p regression subproblem constructed on the randomly sampled constraints. The algorithm first computes a p-well-conditioned basis U for span(A), as described in the proof of Theorem 4.2. Then, in the first stage, the algorithm uses information from the norms of the rows of U to sample constraints from the input `p regression problem. In particular, roughly O(dp+1 ) rows of A, and the corresponding elements of b, are randomly sampled according to the probability distribution given by kUi? kpp pi = min 1, r1 |||U |||pp (

) , where r1 = 82 · 36p dk (d ln(8 · 36) + ln(200)) .

96

(4.11)

Input: An n × m matrix A of rank d, a vector b ∈ Rn , and p ∈ [1, ∞). Let 0 < < 1/7, and define k = max{p/2 + 1, p}. - Find a p-well-conditioned basis U ∈ Rn×d for span(A) (as in the proof of Theorem 4.2) . n kU kp o i? - Stage 1: Define pi = min 1, |||U |||ppp r1 where r1 = 82 · 36p dk (d ln(8 · 36) + ln(200)). 1/p

- Generate (implicitly) S where Sii = 1/pi with probability pi and 0 otherwise. - Let xˆc be the solution to minm kS(Ax − b)kp . x∈R

- Stage 2: Let ρˆn= Aˆ xc −nb, and unless oo ρˆ = 0, |ˆ ρi |p define qi = min 1, max pi , kˆρkpp r2 with r2 =

36p dk 2 1/p

- Generate (implicitly, a new) T where Tii = 1/qi and 0 otherwise. - Let xˆopt be the solution to minm kT (Ax − b)kp .

d ln( 36 ) + ln(200) . with probability qi

x∈R

Output: xˆopt (or xˆc if only the first stage is run). Figure 4.1: Sampling algorithm for `p regression. 1/p

implicitly represented by a diagonal sampling matrix S, where each Sii = 1/pi . For the remainder of the chapter, we will use S to denote the sampling matrix for the first-stage sampling probabilities. The algorithm then solves, using any `p solver of one’s choice, the smaller subproblem. If the solution to the induced subproblem is denoted xˆc , then, as we will see in Theorem 4.4, this is an 8-approximation to the original problem.3 In the second stage, the algorithm uses information from the residual of the 8approximation computed in the first stage to refine the sampling probabilities. Define the residual ρˆ = Aˆ xc − b (and note that kˆ ρkp ≤ 8 Z). Then, roughly O(dp+1 /2 ) rows of A, and the corresponding elements of b, are randomly sampled according to the 3

For p = 2, Drineas, Mahoney, and Muthhukrishnan show that this first stage actually leads to a (1 + )-approximation. For p = 1, Clarkson develops a subgradient-based algorithm and runs it, after preprocessing the input, on all the input constraints to obtain a constant-factor approximation in a stage analogous to our first stage. Here, however, we solve an `p regression problem on a small subset of the constraints to obtain the constant-factor approximation. Moreover, our procedure works for all p ∈ [1, ∞).

97

probability distribution |ˆ ρi |p 36p dk 36 qi = min 1, max pi , r2 , where r2 = d ln( ) + ln(200) . (4.12) kˆ ρkpp 2 As before, this can be represented as a diagonal sampling matrix T , where each 1/p

Tii = 1/qi

with probability qi and 0 otherwise. For the remainder of the chapter,

we will use T to denote the sampling matrix for the second-stage sampling probabilities. Again, the algorithm solves, using any `p solver of one’s choice, the smaller subproblem. If the solution to the induced subproblem at the second stage is denoted xˆopt , then, as we will see in Theorem 4.4, this is a (1 + )-approximation to the original problem.4 The following is our main theorem for the `p regression algorithm presented in Figure 4.1. Theorem 4.4. Let A be an n × m matrix of rank d, let b ∈ Rn , and let p ∈ [1, ∞). p k Recall that r1 = 82 · 36p dk (d ln(8 · 36) + ln(200)) and r2 = 362d d ln( 36 ) + ln(200) . Then, • Constant-factor approximation. If only the first stage of the algorithm in Figure 4.1 is run, then with probability at least 0.6, the solution xˆc to the sampled problem based on the pi ’s of Equation (4.7) is an 8-approximation to the `p regression problem; • Relative-error approximation. If both stages of the algorithm are run, then with probability at least 0.5, the solution xˆopt to the sampled problem based on the qi ’s of Equation (4.12) is a (1 + )-approximation to the `p regression problem; • Running time. The running time of the ith stage of the algorithm is O(nmd+ nd5 log n + φ(20iri , m)), where φ(s, t) is the time taken to solve the regression problem minx∈Rt kA0 x − b0 kp , where A0 ∈ Rs×t is of rank d and b0 ∈ Rs . 4

The subspace-based sampling probabilities (4.11) are similar to those used by Drineas, Mahoney, and Muthukrishnan [34], while the residual-based sampling probabilities (4.12) are similar to those used by Clarkson [15].

98

Note that since the algorithm of Figure 4.1 constructs the (α, β, p)-well-conditioned basis U using the procedure in the proof of Theorem 4.2, our sampling complexity depends on α and β. In particular, it will be O(d(αβ)p ). Thus, if p < 2 our sampling p

1

p

1

1

1

complexity is O(d · d 2 +1 ) = O(d 2 +2 ); if p > 2 it is O(d(d p + 2 d q − 2 )p ) = O(dp+1 ); and (although not explicitly stated, our proof will make it clear that) if p = 2 it is O(d2 ). Note also that we have stated the claims of the theorem as holding with constant probability, but they can be shown to hold with probability at least 1 − δ by using standard amplification techniques.

4.5.2

The first stage of sampling: a constant-factor approximation

To prove the claims of Theorem 4.4 having to do with the output of the algorithm after the first stage of sampling, we begin with two lemmas. First note that, because of our choice of r1 , we can use the subspace preserving Theorem 4.3 with only a constant distortion, i.e., for all x, we have 7 9 kAxkp ≤ kSAxkp ≤ kAxkp 8 8 with probability at least 0.99. The first lemma below now states that the optimal solution to the original problem provides a small (constant-factor) residual when evaluated in the sampled problem. (As in the proof of Theorem 4.3, unsubscripted norms will be p-norms.) Lemma 4.5. kS(Axopt − b)k ≤ 3Z, with probability at least 1 − 1/3p . P Proof. Define Xi = (Sii |Ai? xopt −bi |)p . Thus, i Xi = kS(Axopt − b)kp , and the first P moment is E[ i Xi ] = kAxopt − bkp = Z. The lemma follows since, by Markov’s inequality, " Pr

" X

Xi > 3p E

i p

p

## X i

p

Xi

≤

1 , 3p

i.e., kS(Axopt − b)k > 3 kAxopt − bk , with probability no more than 1/3p . 99

The next lemma states that if the solution to the sampled problem provides a constant-factor approximation (when evaluated in the sampled problem), then when this solution is evaluated in the original regression problem we get a (slightly weaker) constant-factor approximation. Lemma 4.6. If kS(Aˆ xc − b)k ≤ 3 Z, then kAˆ xc − bk ≤ 8 Z. Proof. We will prove the contrapositive: If kAˆ xc − bk > 8 Z, then kS(Aˆ xc − b)k > 3 Z. To do so, note that, by Theorem 4.3, and the choice of r1 , we have that 9 7 kAxkp ≤ kSAxkp ≤ kAxkp . 8 8 Using this, kS(Aˆ xc − b)k ≥ kSA(ˆ xc − xopt )k − kS(Axopt − b)k

(by the triangle inequality)

7 kAˆ xc − Axopt k − 3 Z (by Thm. 4.3 and Lemma 4.5) 8 7 xc − bk − kAxopt − bk) − 3 Z (by the triangle inequality) ≥ (kAˆ 8 7 > (8 Z − Z) − 3 Z (by premise kAˆ xc − bk > 8 Z) 8

≥

> 3 Z, which establishes the lemma. Clearly, kS(Aˆ xc − b)k ≤ kS(Axopt − b)k (since xˆc is an optimum for the sampled `p regression problem). Combining this with Lemmas 4.5 and 4.6, it follows that the solution xˆc to the the sampled problem based on the pi ’s of Equation (4.7) satisfies kAˆ xc − bk ≤ 8 Z, i.e., xˆc is an 8-approximation to the original Z. To conclude the proof of the claims for the first stage of sampling, note that by our choice of r1 , Theorem 4.3 fails to hold for our first stage sampling with probability no greater than 1/100. In addition, Lemma 4.5 fails to hold with probability no greater than 1/3p , which is no greater than 1/3 for all p ∈ [1, ∞). Finally, let rb1 be a random variable representing the number of rows actually chosen by our sampling 100

schema, and note that E[rb1 ] ≤ r1 . By Markov’s inequality, it follows that rb1 > 20r1 with probability less than 1/20. Thus, the first stage of our algorithm fails to give an 8-approximation in the specified running time with a probability bounded by 1/3 + 1/20 + 1/100 < 2/5.

4.5.3

The second stage of sampling: a relative-error approximation

The proof of the claims of Theorem 4.4 having to do with the output of the algorithm after the second stage of sampling will parallel that for the first stage, but it will have several technical complexities that arise since the first triangle inequality approximation in the proof of Lemma 4.6 is too coarse for relative-error approximation. By our choice of r2 again, we have a finer result for subspace preservation. Thus, with probability 0.99, the following holds for all x (1 − ) kAxkp ≤ kSAxkp ≤ (1 + ) kAxkp As before, we start with a lemma that states that the optimal solution to the original problem provides a small (now a relative-error) residual when evaluated in the sampled problem. This is the analog of Lemma 4.5. An important difference is that the second stage sampling probabilities significantly enhance the probability of success. Lemma 4.7. kT (Axopt − b)k ≤ (1 + )Z, with probability at least 0.99. Proof. Define the random variable Xi = (Tii |Ai? xopt − bi |)p , and recall that Ai? = P Ui? τ since A = U τ . Clearly, ni=1 Xi = kT (Axopt − b)kp . In addition, since E[Xi ] = P |Ai? xopt −bi |p , it follows that ni=1 E[Xi ] = kAxopt − bkp . We will use Equation (4.3) P of Theorem 4.1 to bound i (Xi − E[Xi ]) = kT (Axopt − b)kp − kAxopt − bkp . From the definition of qi in Equation (4.12), it follows that for some of the rows, qi may equal 1 (just as in the proof of Theorem 4.3). Since Xi = E[Xi ] for these rows, P P i (Xi − E[Xi ]) = i:pi <1 (Xi − E[Xi ]), and thus we will bound this latter quantity 101

with Equation (4.3). To do so, we must first provide a bound for Xi − E[Xi ] ≤ Xi P P and for i:pi <1 σi2 ≤ i E[Xi2 ]. To that end, note that: |Ai? (xopt − xˆc )| ≤ kUi? kp kτ (xopt − xˆc )kq

(by H¨olders inequality)

≤ kUi? kp β kU τ (xopt − xˆc )kp

(by Def. 4.2 and Thm. 4.2)

≤ kUi? kp β (kAxopt − bk + kAˆ xc − bk) (by the triangle inequality) ≤ kUi? kp β9Z ,

(4.13)

where the final inequality follows from the definition of Z and the results from the first stage of sampling. Next, note that from the conditions on the probabilities qi in Equation (4.12), as well as by Definition 4.2 and the output of the first-stage of sampling, it follows that |ˆ ρi |p kˆ ρkp 8p Z p ≤ ≤ qi r2 r2

and

kUi? kp |||U |||p αp ≤ ≤ , qi r2 r2

(4.14)

for all i such that qi < 1. Thus, since Xi − E[Xi ] ≤ Xi ≤ |Ai? xopt − bi |p /qi , it follows that for all i such that qi < 1, 2p−1 (|Ai? (xopt − xˆc )|p + |ˆ ρi |p ) qi ! p p p p p kU k β 9 Z |ˆ ρ | i? i p ≤ 2p−1 + qi qi

(since ρˆ = Aˆ xc − b )

Xi − E[Xi ] ≤

≤ 2p−1 (αp β p 9p Z p + 8p Z p ) /r2

(4.15)

(by Equation (4.13)) (by Equation (4.14))

≤ cp (αβ)p Z p /r2 ,

(4.16)

where we set cp = 2p−1 (9p + 8p ). Thus, we may define ∆ = cp (αβ)p Z p /r2 . In addition, it follows that X X |Ai? xopt − bi |p E Xi2 = |Ai? xopt − bi |p qi i:qi <1 i:qi <1 X ≤ ∆ |Ai? xopt − bi |p

(by Equation (4.16))

i

≤ cp (αβ)p Z 2p /r2 .

(4.17) 102

To apply the upper tail bound of Equation (4.3) of Theorem 4.1, define γ = ((1 + p )p − 1)Z p . We have γ ≥ p Z p , and since ≤ 1/7, we also have γ ≤ 78 − 1 Z p . Hence, by Equation (4.3) of Theorem 4.1, it follows that p

p

ln Pr [kT (Axopt − b)k > kAxopt − bk + γ] ≤

2

−γ 2 2 i:pi <1 σi + 2γ∆/3

P

−p2 2 r2 . 36p (αβ)p 22 −p r2 Thus, Pr [kT (Axopt − b)k > (1 + )Z] ≤ exp 36 p (αβ)p , from which the lemma fol≤

lows by our choice of r2 . Next we show that if the solution to the sampled problem provides a relative-error approximation (when evaluated in the sampled problem), then when this solution is evaluated in the original regression problem we get a (slightly weaker) relative-error approximation. We first establish two technical lemmas. The following lemma says that for all optimal solutions xˆopt to the second-stage sampled problem, Aˆ xopt is not too far from Aˆ xc , where xˆc is the optimal solution from the first stage, in a p-norm sense. Hence, the lemma will allow us to restrict our calculations in Lemmas 4.9 and 4.10 to the ball of radius 12 Z centered at Aˆ xc . Lemma 4.8. kAˆ xopt − Aˆ xc k ≤ 12 Z. Proof. By two applications of the triangle inequality, it follows that kAˆ xopt − Aˆ xc k ≤ kAˆ xopt − Axopt k + kAxopt − bk + kAˆ xc − bk ≤ kAˆ xopt − Axopt k + 9Z , where the second inequality follows since kAˆ xc − bk ≤ 8 Z from the first stage of sampling and since Z = kAxopt − bk. In addition, we have that 1 kT (Aˆ xopt − Axopt )k (by Theorem 4.3) 1− 1 ≤ (kT (Aˆ xopt − b)k + kT (Axopt − b)k) 1−

kAxopt − Aˆ xopt k ≤

103

2 kT (Axopt − b)k 1 − 1+ kAxopt − bk ≤ 2 1−

≤

(by Lemma 4.7) ,

where the third inequality follows since xˆopt is optimal for the sampled problem. The lemma follows since ≤ 1/7. Thus, if we define the affine ball of radius 12 Z that is centered at Aˆ xc and that lies in span(A), B = {y ∈ Rn : y = Ax, x ∈ Rm , kAˆ xc − yk ≤ 12 Z} ,

(4.18)

xopt ∈ B, for all optimal solutions xˆopt to the sampled then Lemma 4.8 states that Aˆ problem. Let us consider an ε-net, call it Bε , with ε = Z, for this ball B. Using 36 d Z d = . The next lemma states standard arguments, the size of the ε-net is 3·12 Z that for all points in the ε-net, if that point provides a relative-error approximation (when evaluated in the sampled problem), then when this point is evaluated in the original regression problem we get a (slightly weaker) relative-error approximation. Lemma 4.9. For all points Axε in the ε-net, Bε , if kT (Axε − b)k ≤ (1 + 3)Z, then kAxε − bk ≤ (1 + 6)Z, with probability 0.99. Proof. Fix a given point yε∗ = Ax∗ε ∈ Bε . We will prove the contrapositive for this point, i.e., we will prove that if kAx∗ε − bk > (1 + 6)Z, then kT (Ax∗ε − b)k > d 1 . The lemma will then follow from (1 + 3)Z, with probability at least 1 − 100 36 the union bound. To this end, define the random variable Xi = (Tii |Ai? x∗ε − bi |)p , and recall P that Ai? = Ui? τ since A = U τ . Clearly, ni=1 Xi = kT (Ax∗ε − b)kp . In addition, Pn p ∗ since E[Xi ] = |Ai? x∗ε − bi |p , it follows that i=1 E[Xi ] = kAxε − bk . We will use Equation (4.2) of Theorem 4.1 to provide an upper bound for the event that kT (Ax∗ε − b)kp ≤ kAx∗ε − bkp − γ, where γ = kAx∗ε − bkp − (1 + 3)p Z p , under the assumption that kAx∗ε − bk > (1 + 6)Z. 104

From the definition of qi in Equation (4.12), it follows that for some of the rows, qi may equal 1 (just as in the proof of Theorem 4.3). Since Xi = E[Xi ] for these rows, P P i (Xi − E[Xi ]) = i:pi <1 (Xi − E[Xi ]), and thus we will bound this latter quantity P with Equation (4.2). To do so, we must first provide a bound for i:pi <1 E[Xi2 ]. To that end, note that: |Ai? (x∗ε − xˆc )| ≤ kUi? kp kτ (x∗ε − xˆc )kq

(by H¨olders inequality)

≤ kUi? kp β kU τ (x∗ε − xˆc )kp

(by Definition 4.2 and Theorem 4.2)

≤ kUi? k β12Z ,

(4.19)

where the final inequality follows from the radius of the high-dimensional ball in which the ε-net resides. From this, we can show that |Ai? x∗ε − bi | 2p−1 ≤ (|Ai? x∗ε − Ai? xˆc |p + |ˆ ρi |p ) qi qi ρi |p kUi? kp 12p β p Z p |ˆ p−1 ≤2 + qi qi ≤ 2p−1 (αp 12p β p Z p + 8p Z p ) /r2

(since ρˆ = Aˆ xc − b ) (by Equation (4.19)) (by Equation (4.14))

≤ 24p (αβ)p Z p /r2 .

(4.20)

Therefore, we have that X |Ai? x∗ε − bi |p E Xi2 = |Ai? x∗ε − bi |p qi i:q <1 i:q <1 X i

i

24p (αβ)p Z p X ≤ |Ai? x∗ε − bi |p r2 i

(by Equation (4.20))

≤ 24p (αβ)p kAx∗ε − bk2p /r2 .

(4.21)

In order to apply the lower tail bound of Equation (4.2) of Theorem 4.1, define γ = kAx∗ε − bkp − (1 + 3)p Z p . Thus, by Equation (4.21) and by Equation (4.2) of Theorem 4.1 it follows that, ln Pr [kT (Ax∗ε

−r2 (kAx∗ε − bkp − (1 + 3)p Z p )2 − b)k ≤ (1 + 3) Z ] ≤ 24p (αβ)p kAx∗ε − bk2p p

p

p

105

−r2 ≤ p 24 (αβ)p

(1 + 3)p Z p 1− kAx∗ε − bkp

2

2 (1 + 3)p Z p −r2 1− (by the premise) < 24p (αβ)p (1 + 6)p Z p −r2 2 (since ≤ 1/3) ≤ . 24p (αβ)p Since r2 ≥ 24p (αβ)p (d ln( 36 ) + ln(200))/2 , it follows that kT (Ax∗ε − b)k ≤ (1 + 3)Z, d d 1 . Since there are no more than 36 such with probability no greater than 200 36 points in the ε-net, the lemma follows by the union bound.

Finally, the next lemma states that if the solution to the sampled problem (in the second stage of sampling) provides a relative-error approximation (when evaluated in the sampled problem), then when this solution is evaluated in the original regression problem we get a (slightly weaker) relative-error approximation. This is the analog of Lemma 4.6, and its proof will use Lemma 4.9. Lemma 4.10. If kT (Aˆ xopt − b)k ≤ (1 + )Z, then kAˆ xopt − bk ≤ (1 + 7)Z. Proof. We will prove the contrapositive; that is, if kAˆ xopt − bk > (1 + 7)Z then kT (Aˆ xopt − b)k > (1+)Z. Since Aˆ xopt lies in the ball B defined by Equation (4.18) and since the ε-net is constructed in this ball, there exists a point yε = Axε , call it Ax∗ε , such that kAˆ xopt − Ax∗ε k ≤ Z. Thus, kAx∗ε − bk ≥ kAˆ xopt − bk − kAx∗ε − Aˆ xopt k

(by the triangle inequality) (by assumption and the def. of Ax∗ε )

> (1 + 7)Z − Z = (1 + 6)Z .

Next, since Lemma 4.9 holds for all points Axε in the ε-net, it follows that kT (Ax∗ε − b)k > (1 + 3)Z . 106

(4.22)

Finally, note that kT (Aˆ xopt − b)k ≥ kT (Ax∗ε − b)k − kT A(x∗ε − xˆopt )k

(by the triangle inequality)

> (1 + 3)Z − (1 + ) kA(x∗ε − xˆopt )k (by Eq. (4.22) & Thm. 4.3) > (1 + 3)Z − (1 + ) Z

(by the definition of Ax∗ε )

> (1 + )Z , which establishes the lemma. Clearly, kT (Aˆ xopt − b)k ≤ kT (Axopt − b)k, since xˆopt is an optimum for the sampled `p regression problem. Combining this with Lemmas 4.7 and 4.10, it follows that the solution xˆopt to the the sampled problem based on the qi ’s of Equation (4.12) satisfies kAˆ xopt − bk ≤ (1 + ) Z; i.e., xˆopt is a (1 + )-approximation to the original error Z. To conclude the proof of the claims for the second stage of sampling, recall that the first stage failed with probability no greater than 2/5. Note also that by our choice of r2 , Theorem 4.3 fails to hold for our second stage sampling with probability no greater than 1/100. In addition, Lemma 4.7 and Lemma 4.9 each fails to hold with probability no greater than 1/100. Finally, let rb2 be a random variable representing the number of rows actually chosen by our sampling schema in the second stage, and note that E[rb2 ] ≤ 2r2 . By Markov’s inequality, it follows that rb2 > 40r2 with probability less than 1/20. Thus, the second stage of our algorithm fails with probability less than 1/20 + 1/100 + 1/100 + 1/100 < 1/10. By combining both stages, our algorithm fails to give a (1 + )-approximation in the specified running time with a probability bounded from above by 2/5 + 1/10 = 1/2.

4.6

Extensions

In this section we outline several immediate extensions of our main algorithmic result. 107

Constrained `p regression. Our sampling strategies are transparent to constraints placed on x. In particular, suppose we constrain the output of our algorithm to lie within a convex set C ⊆ Rm . If there is an algorithm to solve the constrained `p regression problem minz∈C kA0 x − b0 k, where A0 ∈ Rs×m is of rank d and b0 ∈ Rs , in time φ(s, m), then by modifying our main algorithm in a straightforward manner, we can obtain an algorithm that gives a (1 + )-approximation to the constrained `p regression problem in time O(nmd + nd5 log n + φ(40r2 , m)). Generalized `p regression. Our sampling strategies extend to the case of generalized `p regression: given as input a matrix A ∈ Rn×m of rank d, a target matrix B ∈ Rn×k , and a real number p ∈ [1, ∞), find a matrix X ∈ Rm×k such that |||AX − B|||p is minimized. To do so, we generalize our sampling strategies in a straightforward manner. The probabilities pi for the first stage of sampling ˆ c is the solution to the first-stage sampled are the same as before. Then, if X ˆ c − B, and define the second problem, we can define the n × k matrix ρˆ = AX stage sampling probabilities to be qi = min 1, max{pi , r2 kˆ ρi? kpp /|||ˆ ρ|||pp } . Then, we ˆ opt computed from the second-stage sampled problem satisfies can show that the X ˆ opt − B|||p ≤ (1 + ) minX∈Rm×k |||AX − B|||p , with probability at least 1/2. |||AX Weighted `p regression. Our sampling strategies also generalize to the case of `p regression involving weighted p-norms: if w1 , . . . , wn are a set of non-negative weights then the weighted p-norm of a vector y ∈ Rn may be defined as kykp,w = P 1/p ( ni=1 wi |yi |p ) , and the weighted analog of the matrix p-norm |||·|||p may be defined 1/p P d as |||U |||p,w = kU k . Our sampling schema proceeds as before. First, ?j p,w j=1 we compute a “well-conditioned” basis U for span(A) with respect to this weighted p-norm. The sampling probabilities pi for the first stage of the algorithm are then pi = min 1, r1 wi kUi? kpp /|||U |||pp,w , and the sampling probabilities qi for the second stage are qi = min 1, max{pi , r2 wi |ˆ ρi |p /kˆ ρkpp,w } , where ρˆ is the residual from the first stage. 108

General sampling probabilities. More generally, consider any sampling proban n kU kp o o i? p |(ρopt )i |p bilities of the form: pi ≥ min 1, max |||U |||pp , Z p r , where ρopt = Axopt − b p k and r ≥ 362d d ln( 36 ) + ln(200) and where we adopt the convention that 00 = 0. Then, by an analysis similar to that presented for our two stage algorithm, we can show that, by picking O(36p dp+1 /2 ) rows of A and the corresponding elements of b (in a single stage of sampling) according to these probabilities, the solution xˆopt to the sampled `p regression problem is a (1 + )-approximation to the original problem, with probability at least 1/2. (Note that these sampling probabilities, if an equality is used in this expression, depend on the entries of the vector ρopt = Axopt − b; in particular, they require the solution of the original problem. This is reminiscent of the results of [34]. Our main two-stage algorithm shows that by solving a problem in the first stage based on coarse probabilities, we can refine our probabilities to approximate these probabilities and thus obtain an (1 + )-approximation to the `p regression problem more efficiently.)

109

Chapter 5 Conclusions and Open Problems In this dissertation we explored two regression problems on large data that are motivated by and are useful for a variety of applications. In treating both problems, the aim was to design algorithms for summarizing the given data so that it became more useful and manageable. In the first problem, a signal arrives as a stream of input which we wish to represent as accurately as possible using a linear combination of a small number of wavelet basis vectors. In the second, we are given a large overconstrained system of equations from which we wish to extract a small coreset of constraints such that the solution to this subset yields a good solution to the whole system. While these problems have been studied before, my dissertation presents new techniques for tackling these problems (and variants thereof) under general `p norms, which make them non-trivial to solve. Future work includes both theoretical and empirical questions. For example, I plan to investigate the usefulness of these newly-developed sparse wavelet representation algorithms for classifying images. A complicating problem is that outliers in a signal can degrade its concise representation. Is it possible, then, to represent signals sparsely using a redundant dictionary of wavelet and Dirac bases in a provably-approximate manner? Feature selection for classification using supportvector machines and regularized least-squares regression constitutes one practical 110

application of the coreset extraction algorithm. Another application to pursue is collaborative filtering. These applications lead, for example, to the theoretical question of whether we can relax the exact rank requirement of the dictionary matrix A and still obtain coresets with strong guarantees. The following sections summarize our contributions, and elaborate on the aforementioned open questions.

5.1

Summary of Contributions

• We presented two approximation algorithms for Problem 1.1 (Chapters 2 and 3). The first is a greedy O(log n) approximation algorithm that runs in linear time, uses logarithmic space, and performs only one-pass over the target vector. Its approximation guarantee is also a guarantee on the gap between the restricted and unrestricted versions of the problem since the algorithm limits its choices to wavelet coefficients of the data. The second algorithm is a dynamic programming FPTAS specific to the Haar wavelet basis. This algorithm is also “one-pass” and uses sub-linear space under `p norms when p > 1. Specifically for the `∞ case, it runs in near-linear time and poly-logarithmic space. The algorithm, with appropriate modifications, also yields a QPTAS for other compactly-supported wavelets. • The principal technique used to design and establish the guarantees of these algorithms is a lower-bounding technique based on the wavelet coefficients and on the scaling coefficients of the target vector, where the latter coefficients are associated with the multiresolution scaling function of the wavelet. Essentially, the lower bounds indicate the ranges within which the optimal solution’s expansion and scaling coefficients lie. These bounds guided our search for a solution that approximates the optimum. • We showed how our algorithms extend to other related problems; namely, the 111

bit-budget version of the problem, and the best-basis selection problem when the dictionary is a tree-structured dictionary of wavelet bases. • We gave a randomized sampling-based algorithm that solves Problem 1.2 (Chapter 4). Consequently, we demonstrated the existence of coresets for the (weighted) `p over-constrained regression problem for all p ∈ [1, ∞), even when the solution is restricted to a convex space. The sizes of the coresets we find are independent of the original number of constraints n (the high-dimension), p

and are on the order of O(dmax{ 2 +1, p}+1 /2 ) where d is the number of variables (the low-dimension). Our algorithm ran in two stages where the results of the first stage “fed into” the second, which in turn resulted in strong relative-error approximation guarantees. • The main concepts used in designing our sampling probabilities and proving our result are subspace-preserving sampling and a notion of p-well-conditioned bases. We showed that these bases exist, that they can be computed in time O(nd5 log n), and that sampling constraints based on the rows of these bases minimizes the sampling variance and preserves the subspace of the given matrix A. Our proofs also made heavy-use of ε-net arguments in both stages of the algorithm.

5.2

Future Research

• One immediate application to investigate is to utilize our sparse wavelet representation algorithms for image categorization and search. In text classification, documents are usually represented as sparse feature vectors in highdimensional space. Images, however, do not lend themselves easily to sparse representations since the inter-pixel relationships are very important for correctly understanding the image. The algorithms presented in my dissertation 112

for the nonlinear wavelet regression problem provide a way to represent images sparsely in a meaningful, provably-approximate manner. These wavelet representations may then be combined with well-known data analysis methods, such as Regularized Least Squares Classification (RLSC) and Support Vector Machine (SVM) classification, in order to classify the images, which is an important step for better image search. Engineering decisions include the choice of wavelet, the sparsity bound, and the `p norm used for measuring the representation error. • There are further theoretical questions to pursue in this direction. Can we make the Haar PTAS use sub-linear space under the `1 norm; i.e., can we make it stream under this error measure? It seems that a stronger lower bound is needed in order to accomplish this. Moreover, would it be possible to approximate the sparse wavelet regression problem under the streaming update model, where the signal is defined implicitly via a stream of updates? Even when allowed to store the whole signal, can we show a PTAS for other compactly-supported wavelets, especially Daubechies wavelets? What if the dictionary is a mixture of a wavelet basis and the standard Dirac basis? This allows the removal of outliers using the Dirac basis, and the better approximation of the rest of the signal using the wavelet basis. A direct application of our results gives an additive approximation, but we would like to obtain relativeerror guarantees. To this end, can we use sampling, for example, to choose roughly B dictionary elements and obtain at least a constant-factor approximation? These questions further explore the field of nonlinear approximation through the lens of approximation algorithms. • A natural avenue to pursue is to investigate whether we can apply and extend our sampling algorithm for linear regression to other problems that operate on large data. For example, in recent work [18] we employed our algorithm as an unsupervised feature selection method for kernel-based text classification.

113

Motivated by recent work in applied data analysis—for example, the aforementioned RLSC and SVM classification, and the Lasso shrinkage and selection method for linear regression and classification—that has a strongly geometric flavor, we view feature selection as a problem in dimensionality reduction. But rather than resorting to the Singular Value Decomposition (which would produce a small number of axes that are linear combinations of all of the features), we use our sampling algorithm to choose a small number of features that preserve the relevant geometric structure in the data (at least insofar as the particular classification algorithm is concerned). Under the RLSC algorithm, we provide worst-case guarantees on the classification function that results from the features selected using our method with respect to the classification function obtained when considering all the features. However, our sampling techniques do not only apply to `2 regression. By considering the p = 1 case we may be able to show similar results for SVMs, which would be of interest due to the ubiquity of SVM-based classifiers in large-scale text analysis. • Since our sampling algorithm for linear regression runs in two stages, it lends itself well to collaborative filtering applications. Suppose we have a small number of users, possibly paid guinea pigs, that have rated a large number of products. Call this product-user matrix A. Now we get a new user b whose preferences are private, and we are allowed to query the user on a small set of products from which we will predict the rest of his or her ratings using linear regression. The set of products we choose to query can be based on our algorithm’s first-stage sampling probabilities since they are computed based on A only and they preserve the rank of A. Hence, we can query the user on the sampled products and produce a weight vector xˆ by solving the resultant regression problem. The sampling algorithm ensures that the error of the representation Aˆ x is a constant factor away from the error of the best possible representation of user b’s preferences as a linear combination of the

114

preferences of the guinea pigs. Preliminary results are encouraging and I plan to further investigate how such a system performs. It may be possible to use these techniques to forecast the growth of social networks by predicting, for example, the groups a new user will join. These problems are challenging, but the work in my dissertation sheds a new light on them that may prove beneficial. • Related theoretical questions include relaxing the exact rank requirement of the dictionary matrix A, and establishing a lower bound on the number of samples that any row-sampling algorithm would need in order to find an `p regression coreset. To this end, a natural goal is to obtain a sampling lower bound that depends on both d and p; nevertheless, even an ω(d) lower bound for 1 ≤ p ≤ 2 would be interesting.

115

Bibliography [1] A. Abdulle and G. Wanner. 200 years of least squares method. Elemente der Mathematik, 57(2):45–60, 2002. [2] P. K. Agarwal, S. Har-Peled, and K. R. Varadarajan. Geometric approximation via coresets. In Jacob E. Goodman, Janos Pach, and Emo Welzl, editors, Combinatorial and Computational Geometry, volume 52, pages 1–30. Cambridge University Press, 2005. [3] Pankaj K. Agarwal, Sariel Har-Peled, and Kasturi R. Varadarajan. Approximating extent measures of points. J. ACM, 51(4):606–635, 2004. [4] Nir Ailon and Bernard Chazelle. Approximate nearest neighbors and the fast Johnson–Lindenstrauss transform. In Proceedings of the 38th Annual ACM Symposium on Theory of Computing, pages 557–563, 2006. [5] Herman Auerbach. On the Area of Convex Curves with Conjugate Diameters (in Polish). PhD thesis, University of Lw´ow, 1930. [6] Baruch Awerbuch and Robert D. Kleinberg. Adaptive routing with end-to-end feedback: Distributed learning and geometric approaches. In Proceedings of the 36th Annual ACM Symposium on Theory of Computing, pages 45–53, 2004. [7] Mihai B¨adoiu and Kenneth L. Clarkson. Smaller core-sets for balls. In SODA ’03: Proceedings of the fourteenth annual ACM-SIAM symposium on Discrete 116

algorithms, pages 801–802, Philadelphia, PA, USA, 2003. Society for Industrial and Applied Mathematics. [8] S. Bernstein. Theory of Probability. Moscow, 1927. [9] J. Bourgain, J. Lindenstrauss, and V.D. Milman. Approximation of zonoids by zonotopes. Acta Math., 162:73–141, 1989. [10] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. [11] K. Chakrabarti, E. J. Keogh, S. Mehrotra, and M. J. Pazzani. Locally adaptive dimensionality reduction for indexing large time series databases. ACM TODS, 27(2):188–228, 2002. [12] Kaushik Chakrabarti, Minos Garofalakis, Rajeev Rastogi, and Kyuseok Shim. Approximate query processing using wavelets. The VLDB Journal, 10(2-3):199– 223, 2001. [13] Samprit Chatterjee, Ali S. Hadi, and Bertram Price. Regression Analysis by Example. Wiley Series in Probability and Statistics, 2000. [14] C. Christopoulos, A. Skodras, and T. Ebrahimi. The JPEG2000 still image coding system: an overview. Consumer Electronics, IEEE Transactions on, 46(4):1103–1127, 2000. [15] Kenneth L. Clarkson. Subgradient and sampling algorithms for `1 regression. In Proceedings of the 16th Annual ACM-SIAM Symposium on Discrete Algorithms, pages 257–266, 2005. [16] A. Cohen, I. Daubechies, O. Guleryuz, and M. Orchard. On the importance of combining wavelet-based non-linear approximation in coding strategies. IEEE Transactions on Information Theory, 48(7):1895–1921, 2002. 117

[17] Ronald R. Coifman and M. Victor Wickerhauser. Entropy-based algorithms for best basis selection. IEEE Transactions on Information Theory, 38(2):713–718, 1992. [18] Anirban Dasgupta, Petros Drineas, Boulos Harb, Vanja Josifovski, and Michael W. Mahoney.

Feature selection methods for text classification.

Manuscript submitted for review, 2007. [19] Anirban Dasgupta, Petros Drineas, Boulos Harb, Ravi Kumar, and Michael W. Mahoney. Sampling algorithms and coresets for `p regression. Manuscript submitted for review, 2007. [20] Ingrid Daubechies. Orthonormal bases of compactly supported wavelets. Comm. Pure. Appl. Math., 41:909–996, 1988. [21] Ingrid Daubechies. Ten lectures on wavelets. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, 1992. [22] Ingrid Daubechies. Orthonormal Bases of Compactly Supported Wavelets II. Variations on a Theme. SIAM Journal on Mathematical Analysis, 24(2):499– 519, 1993. [23] G. Davis, S. Mallat, and M. Avellaneda. Adaptive greedy approximation. Journal of Constructive Approximation, 13:57–98, 1997. [24] Malon Day. Polygons circumscribed about closed convex curves. Transactions of the American Mathematical Society, 62:315–319, 1947. [25] R. DeVore. Nonlinear approximation. Acta Numerica, pages 1–99, 1998. [26] R. DeVore, B. Jawerth, and V. A. Popov. Compression of wavelet decompositions. Amer. J. Math., 114:737–785, 1992. 118

[27] R. A. DeVore, S. V. Konyagin, and V. N. Temlyakov. Hyperbolic wavelet approximation. Constructive Approximation, 14:1–26, 1998. [28] D. Donoho and I. Johnstone. Ideal space adaptation via wavelet shrinkage. Biometrika, 81:425–455, 1994. [29] D. L. Donoho, I. M. Johnstone, G. Kerkyacharian, and D. Picard. Wavelet shrinkage: Asymptopia? J. Royal Statistical Soc., Ser. B, 57:301–369, 1996. [30] P. Drineas, R. Kannan, and M.W. Mahoney. Fast Monte Carlo algorithms for matrices I: Approximating matrix multiplication. SIAM Journal on Computing, 36:132–157, 2006. [31] P. Drineas, R. Kannan, and M.W. Mahoney. Fast Monte Carlo algorithms for matrices II: Computing a low-rank approximation to a matrix. SIAM Journal on Computing, 36:158–183, 2006. [32] P. Drineas, R. Kannan, and M.W. Mahoney. Fast Monte Carlo algorithms for matrices III: Computing a compressed approximate matrix decomposition. SIAM Journal on Computing, 36:184–206, 2006. [33] P. Drineas, M. W. Mahoney, and S. Muthukrishnan. Relative-error low-rank matrix approximations: Subspace sampling and approximate `2 regression for improved data analysis methods. Manuscript submitted for review, 2007. [34] Petros Drineas, Michael W. Mahoney, and S. Muthukrishnan. Sampling algorithms for `2 regression and applications. In Proceedings of the 17th Annual ACM-SIAM Symposium on Discrete Algorithm, pages 1127–1136, 2006. [35] Uriel Feige. A threshold of ln n for approximating set cover. J. ACM, 45(4):634– 652, 1998. [36] Dan Feldman, Amos Fiat, and Micha Sharir. Coresets for weighted facilities and their applications. In Proceedings of the 47th Annual IEEE Symposium on 119

Foundations of Computer Science (FOCS’06), pages 315–324. IEEE Computer Society, 2006. [37] A. Frieze, R. Kannan, and S. Vempala. Fast Monte-Carlo algorithms for finding low-rank approximations. In Proceedings of the 39th Annual IEEE Symposium on Foundations of Computer Science, pages 370–378, 1998. [38] M. Garofalakis and A. Kumar. Deterministic wavelet thresholding for maximum error metric. Proc. of PODS, 2004. [39] M. N. Garofalakis and P. B. Gibbons. Probabilistic wavelet synopses. ACM TODS, 29:43–90, 2004. [40] Minos Garofalakis and Amit Kumar. Wavelet synopses for general error metrics. ACM Trans. Database Syst., 30(4):888–928, 2005. [41] A. C. Gilbert, S. Guha, P. Indyk, Y. Kotidis, S. Muthukrishnan, and Martin Strauss. Fast, small-space algorithms for approximate histogram maintenance. In Proc. of ACM STOC, 2002. [42] A. C. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss. Optimal and approximate computation of summary statistics for range aggregates. In Proc. of ACM PODS, 2001. [43] Anna C. Gilbert, S. Muthukrishnan, and Martin Strauss. Approximation of functions over redundant dictionaries using coherence. Proc. of SODA, pages 243–252, 2003. [44] Anna C. Gilbert, S. Muthukrishnan, and Martin Strauss. Approximation of functions over redundant dictionaries using coherence. Proc. of SODA, pages 243–252, 2003. [45] G.H. Golub and C.F. Van Loan. Matrix Computations. Johns Hopkins University Press, 1989. 120

[46] S. Guha. Space efficiency in synopsis construction problems. Proc. of VLDB Conference, 2005. [47] S. Guha, P. Indyk, S. Muthukrishnan, and M. Strauss. Histogramming data streams with fast per-item processing. In Proc. of ICALP, 2002. [48] S. Guha, C. Kim, and K. Shim. XWAVE: Optimal and approximate extended wavelets for streaming data. Proceedings of VLDB Conference, 2004. [49] S. Guha, N. Koudas, and K. Shim. Data Streams and Histograms. In Proc. of STOC, 2001. [50] S. Guha, N. Mishra, R. Motwani, and L. O’Callaghan. Clustering data streams. Proceedings of the Symposium on Foundations of Computer Science (FOCS), pages 359–366, 2000. [51] Sudipto Guha and Boulos Harb. Wavelet synopsis for data streams: minimizing non-euclidean error. In KDD ’05: Proceeding of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining, pages 88–97, New York, NY, USA, 2005. ACM Press. [52] Sudipto Guha and Boulos Harb. Approximation algorithms for wavelet transform coding of data streams. In SODA ’06: Proceedings of the seventeenth annual ACM-SIAM symposium on Discrete algorithm, pages 698–707, New York, NY, USA, 2006. ACM Press. [53] A. Haar. Zur theorie der orthogonalen funktionen-systeme. Math. Ann., 69:331– 371, 1910. [54] Sariel Har-Peled and Soham Mazumdar. On coresets for k-means and k-median clustering. In STOC ’04: Proceedings of the thirty-sixth annual ACM symposium on Theory of computing, pages 291–300, New York, NY, USA, 2004. ACM Press. 121

[55] T. Hastie, R. Tibshirani, and J.H. Friedman. The Elements of Statistical Learning. Springer, 2003. [56] Y. E. Ioannidis. The history of histograms (abridged). Proc. of VLDB Conference, pages 19–30, 2003. [57] C. E. Jacobs, A. Finkelstein, and D. H. Salesin. Fast multiresolution image querying. Computer Graphics, 29(Annual Conference Series):277–286, 1995. [58] F. John. Extremum problems with inequalities as subsidiary conditions. In K. O. Friedrichs, O. E. Neugebauer, and J. J. Stoker, editors, Studies and Essays Presented to R. Courant on his 60th Birthday, pages 187–204. Interscience, New York, Jan. 1948. [59] E. Keogh, K. Chakrabati, S. Mehrotra, and M. Pazzani. Locally Adaptive Dimensionality Reduction for Indexing Large Time Series Databases. Proc. of ACM SIGMOD, Santa Barbara, March 2001. [60] Jon Kleinberg and Mark Sandler. Using mixture models for collaborative filtering. In Proceedings of the 36th Annual ACM Symposium on Theory of Computing, pages 569–578, 2004. [61] P. G. Lemari´e, editor. Les ondelettes en 1989. Number 1438 in Lecture notes in Mathematics. Spring-Verlag, Berlin, 1990. [62] DR Lewis. Ellipsoids defined by Banach ideal norms. Mathematika, 26(1):18–29, 1979. [63] S. Mallat. Multiresolution approximations and wavelet orthonormal bases of l2 (R). Trans. Amer. Math. Soc., 315:69–87, September 1989. [64] S. Mallat. A wavelet tour of signal processing. Academic Press, 1999. [65] Y. Matias and D. Urieli. Personal communication, 2004. 122

[66] Y. Matias and D. Urieli. Optimal workload-based weighted wavelet synopses. Proc. of ICDT, pages 368–382, 2005. [67] Y. Matias, J. Scott Vitter, and M. Wang. Wavelet-Based Histograms for Selectivity Estimation. Proc. of ACM SIGMOD, 1998. [68] A. Maurer. A bound on the deviation probability for sums of non-negative random variables. Journal of Inequalities in Pure and Applied Mathematics, 4(1), 2003. [69] H.C. McCullough. Kokin Wakashu: The First Imperial Anthology of Japanese Poetry. Stanford University Press, Palo Alto, 1984. Translated from Japanese. [70] Y. Meyer. Wavelets and operators. Advanced mathematics. Cambridge University Press, 1992. [71] S. Muthukrishnan. Nonuniform sparse approximation using haar wavelet basis. DIMACS TR 2004-42, 2004. [72] S. Muthukrishnan. Data streams: algorithms and applications. Foundations and Trends in Theoretical Computer Science, 1(2):117–236, 2005. [73] Apostol Natsev, Rajeev Rastogi, and Kyuseok Shim. Walrus: a similarity retrieval algorithm for image databases. In SIGMOD ’99: Proceedings of the 1999 ACM SIGMOD international conference on Management of data, pages 395–406, New York, NY, USA, 1999. ACM Press. [74] L. Rademacher, S. Vempala, and G. Wang. Matrix approximation and projective clustering via iterative sampling. Technical Report MIT-LCS-TR-983, Massachusetts Institute of Technology, Cambridge, MA, March 2005. [75] Santiago Gonzalez Sanchez, Nuria Gonzalez Prelcic, and Sergio J. Garca Galan. Uvi Wave version 3.0—Wavelet toolbox for use with matlab. 123

[76] Tam´as Sarl´os. Improved approximation algorithms for large matrices via random projections. In Proceedings of the 47th Annual IEEE Symposium on Foundations of Computer Science (FOCS’06), pages 143–152. IEEE Computer Society, 2006. [77] G. Schechtman. More on embedding subspaces of Lp in `nr . Compositio Math, 61(2):159–169, 1987. [78] Gideon Schechtman and Artem Zvavitch. Embedding subspaces of Lp into `N p , 0 < p < 1. Math. Nachr., 227:133–142, 2001. [79] E. Schmidt. Zur theorie der linearen und nichtlinearen integralgleichungen - i. Math. Annalen, 63:433–476, 1907. [80] Eric J. Stollnitz, Tony D. Derose, and David H. Salesin. Wavelets for computer graphics: theory and applications. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 1996. [81] Michel Talagrand. Embedding subspaces of L1 into `N 1 . Proceedings of the American Mathematical Society, 108(2):363–369, February 1990. [82] Michel Talagrand. Embedding subspaces of Lp into `N p . Oper. Theory Adv. Appl., 77:311–325, 1995. [83] Angus Taylor. A geometric theorem and its application to biorthogonal systems. Bulletin of the American Mathematical Society, 53:614–616, 1947. [84] V. N. Temlyakov. The best m-term representation and greedy algorithms. Advances in Computational Mathematics, 8:249–265, 1998. [85] V. N. Temlyakov. Nonlinear methods of approximation. Foundations of Computational Mathematics, 3:33–107, 2003. [86] V. Vazirani. Approximation Algorithms. Springer Verlag, 2001. 124

[87] Jeffrey Scott Vitter and Min Wang. Approximate computation of multidimensional aggregates of sparse data using wavelets. In SIGMOD ’99: Proceedings of the 1999 ACM SIGMOD international conference on Management of data, pages 193–204, New York, NY, USA, 1999. ACM Press. [88] Jeffrey Scott Vitter, Min Wang, and Bala Iyer. Data cube approximation and histograms via wavelets. In CIKM ’98: Proceedings of the seventh international conference on Information and knowledge management, pages 96–104, New York, NY, USA, 1998. ACM Press. [89] P. Wojtaszczyk. Banach Spaces for Analysts, volume 25 of Cambridge Studies in Advanced Mathematics. Cambridge University Press, 1991.

125

Approximation Algorithms for Wavelet Transform Coding of Data ...