3

Bell Labs, Lucent Technologies [email protected] 2 Intel Research Berkeley [email protected] National Technical University of Athens [email protected]

Abstract. Recent years have seen growing interest in effective algorithms for summarizing and querying massive, high-speed data streams. Randomized sketch synopses provide accurate approximations for general-purpose summaries of the streaming data distribution (e.g., wavelets). The focus of existing work has typically been on minimizing space requirements of the maintained synopsis — however, to effectively support high-speed data-stream analysis, a crucial practical requirement is to also optimize: (1) the update time for incorporating a streaming data element in the sketch, and (2) the query time for producing an approximate summary (e.g., the top wavelet coefficients) from the sketch. Such time costs must be small enough to cope with rapid stream-arrival rates and the realtime querying requirements of typical streaming applications (e.g., ISP network monitoring). With cheap and plentiful memory, space is often only a secondary concern after query/update time costs. In this paper, we propose the first fast solution to the problem of tracking wavelet representations of one-dimensional and multi-dimensional data streams, based on a novel stream synopsis, the Group-Count Sketch (GCS). By imposing a hierarchical structure of groups over the data and applying the GCS, our algorithms can quickly recover the most important wavelet coefficients with guaranteed accuracy. A tradeoff between query time and update time is established, by varying the hierarchical structure of groups, allowing the right balance to be found for specific data stream. Experimental analysis confirms this tradeoff, and shows that all our methods significantly outperform previously known methods in terms of both update time and query time, while maintaining a high level of accuracy.

1 Introduction Driven by the enormous volumes of data communicated over today’s Internet, several emerging data-management applications crucially depend on the ability to continuously generate, process, and analyze massive amounts of data in real time. A typical example domain here comprises the class of continuous event-monitoring systems deployed in a wide variety of settings, ranging from network-event tracking in large ISPs to transaction-log monitoring in large web-server farms and satellite-based environmental monitoring. For instance, tracking the operation of a nationwide ISP network Y. Ioannidis et al. (Eds.): EDBT 2006, LNCS 3896, pp. 4–22, 2006. c Springer-Verlag Berlin Heidelberg 2006

Fast Approximate Wavelet Tracking on Streams

5

requires monitoring detailed measurement data from thousands of network elements across several different layers of the network infrastructure. The volume of such monitoring data can easily become overwhelming (in the order of Terabytes per day). To deal effectively with the massive volume and continuous, high-speed nature of data in such environments, the data streaming paradigm has proven vital. Unlike conventional database query-processing engines that require several (expensive) passes over a static, archived data image, streaming data-analysis algorithms rely on building concise, approximate (but highly accurate) synopses of the input stream(s) in real-time (i.e., in one pass over the streaming data). Such synopses typically require space that is significantly sublinear in the size of the data and can be used to provide approximate query answers with guarantees on the quality of the approximation. In many monitoring scenarios, it is neither desirable nor necessary to maintain the data in full; instead, stream synopses can be used to retain enough information for the reliable reconstruction of the key features of the data required in analysis. The collection of the top (i.e., largest) coefficients in the wavelet transform (or, decomposition) of an input data vector is one example of such a key feature of the stream. Wavelets provide a mathematical tool for the hierarchical decomposition of functions, with a long history of successful applications in signal and image processing [16, 22]. Applying the wavelet transform to a (one- or multi-dimensional) data vector and retaining a select small collection of the largest wavelet coefficient gives a very effective form of lossy data compression. Such wavelet summaries provide concise, general-purpose summaries of relational data, and can form the foundation for fast and accurate approximate query processing algorithms (such as approximate selectivity estimates, OLAP range aggregates and approximate join and multi-join queries. Wavelet summaries can also give accurate (one- or multi-dimensional) histograms of the underlying data vector at multiple levels of resolution, thus providing valuable primitives for effective data visualization. Most earlier stream-summarization work focuses on minimizing the space requirements for a given level of accuracy (in the resulting approximate wavelet representation) while the data vector is being rendered as a stream of arbitrary point updates. However, while space is an important consideration, it is certainly not the only parameter of interest. To effectively support high-speed data-stream analyses, two additional key parameters of a streaming algorithm are: (1) the update time for incorporating a streaming update in the sketch, and (2) the query time for producing the approximate summary (e.g., the top wavelet coefficients) from the sketch. Minimizing query and update times is a crucial requirement to cope with rapid stream-arrival rates and the real-time querying needs of modern streaming applications. Furthermore, there are essential tradeoffs between the above three parameters (i.e., space, query time, and update time), and it can be argued that space usage is often the least important of these. For instance, consider monitoring a stream of active network connections for the users consuming the most bandwidth (commonly referred to as the “top talkers” or “heavy hitters” [6, 18]). Typical results for this problem give a stream-synopsis space requirement of O(1/), meaning that an accuracy of = 0.1% requires only a few thousands of storage locations, i.e., a few Kilobytes, which is of little consequence at all in today’s off-the-shelf systems

6

G. Cormode, M. Garofalakis, and D. Sacharidis

featuring Gigabytes of main memory1. Now, suppose that the network is processing IP packets on average a few hundred bytes in length at rates of hundreds of Mbps; essentially, this implies that the average processing time per packet must much less than one millisecond: an average system throughput of tens to hundreds of thousands of packets per second. Thus, while synopsis space is probably a non-issue in this setting, the times to update and query the synopsis can easily become an insurmountable bottleneck. To scale to such high data speeds, streaming algorithms must guarantee provably small time costs for updating the synopsis in real time. Small query times are also important, requiring near real-time response. (e.g., for detecting and reacting to potential network attacks). In summary, we need fast item processing, fast analysis, and bounded space usage — different scenarios place different emphasis on each parameter but, in general, more attention needs to be paid to the time costs of streaming algorithms. Our Contributions. The streaming wavelet algorithms of Gilbert et al. [11] guaranteed small space usage, only polylogarithmic in the size of the vector. Unfortunately, the update- and query-time requirements of their scheme can easily become problematic for real-time monitoring applications, since the whole data structure must be “touched” for each update, and every wavelet coefficient queried to find the best few. Although [11] tries to reduce this cost by introducing more complex range-summable hash functions to make estimating individual wavelet coefficients faster, the number of queries does not decrease, and the additional complexity of the hash functions means that the update time increases further. Clearly, such high query times are not acceptable for any realtime monitoring environment, and pose the key obstacle in extending the algorithms in [11] to multi-dimensional data (where the domain size grows exponentially with data dimensionality). In this paper, we propose the first known streaming algorithms for space- and timeefficient tracking of approximate wavelet summaries for both one- and multi-dimensional data streams. Our approach relies on a novel, sketch-based stream synopsis structure, termed the Group-Count Sketch (GCS) that allows us to provide similar space/accuracy tradeoffs as the simple sketches of [11], while guaranteeing: (1) small, logarithmic update times (essentially touching only a small fraction of the GCS for each streaming update) with simple, fast, hash functions; and, (2) polylogarithmic query times for computing the top wavelet coefficients from the GCS. In brief, our GCS algorithms rely on two key, novel technical ideas. First, we work entirely in the wavelet domain, in the sense that we directly sketch wavelet coefficients, rather than the original data vector, as updates arrive. Second, our GCSs employ group structures based on hashing and hierarchical decomposition over the wavelet domain to enable fast updates and efficient binary-search-like techniques for identifying the top wavelet coefficients in sublinear time. We also demonstrate that, by varying the degree of our search procedure, we can effectively explore the tradeoff between update and query costs in our GCS synopses. Our GCS algorithms and results also naturally extend to both the standard and nonstandard form of the multi-dimensional wavelet transform, essentially providing the only known efficient solution for streaming wavelets in more than one dimension. As 1

One issue surrounding using very small space is whether the data structure fits into the faster cache memory, which again emphasizes the importance of running time costs.

Fast Approximate Wavelet Tracking on Streams

7

our experimental results with both synthetic and real-life data demonstrate, our GCS synopses allow very fast update and searching, capable of supporting very high speed data sources.

2 Preliminaries In this section, we first discuss the basic elements of our stream-processing model and briefly introduce AMS sketches [2]; then, we present a short introduction to the Haar wavelet decomposition in both one and multiple dimensions, focusing on some of its key properties for our problem setting. 2.1 Stream Processing Model and Stream Sketches Our input comprises a continuous stream of update operations, rendering a data vector a of N values (i.e., the data-domain size). Without loss of generality, we assume that the index of our data vector takes values in the integer domain [N ] = {0, . . . , N − 1}, where N is a power of 2 (to simplify the notation). Each streaming update is a pair of the form (i, ±v), denoting a net change of ±v in the a[i] entry; that is, the effect of the update is to set a[i] ← a[i] ± v. Intuitively, “+v” (“−v”) can be seen as v insertions (resp., deletions) of the ith vector element, but more generally we allow entries to take negative values. (Our model instantiates the most general and, hence, most demanding turnstile model of streaming computations [20].) Our model generalizes to multi-dimensional data: for d data dimensions, a is a d-dimensional vector (tensor) and each update ((i1 , . . . , id ), ±v) effects a net change of ±v on entry a[i1 , . . . , id ].2 In the data-streaming context, updates are only seen once in the (fixed) order of arrival; furthermore, the rapid data-arrival rates and large data-domain size N make it impossible to store a explicitly. Instead, our algorithms can only maintain a concise synopsis of the stream that requires only sublinear space, and, at the same time, can (a) be maintained in small, sublinear processing time per update, and (b) provide query answers in sublinear time. Sublinear here means polylogarithmic in N , the data-vector size. (More strongly, our techniques guarantee update times that are sublinear in the size of the synopsis.) Randomized AMS Sketch Synopses for Streams. The randomized AMS sketch [2] is a broadly applicable stream synopsis structure based on maintaining randomized linear projections of the streaming input data vector a. Briefly, an atomic AMS sketch of a is simply the inner product a, ξ = i a[i]ξ(i), where ξ denotes a random vector of fourwise independent ±1-valued random variates. Such variates can be easily generated on-line through standard pseudo-random hash functions ξ() using only O(log N ) space (for seeding) [2, 11]. To maintain this inner product over the stream of updates to a, initialize a running counter X to 0 and set X ← X ± vξ(i) whenever the update (i, ±v) is seen in the input stream. An AMS sketch of a comprises several independent 2

Without loss of generality we assume a domain of [N ]d for the d-dimensional case — different dimension sizes can be handled in a straightforward manner. Further, our methods do not need to know the domain size N beforehand — standard adaptive techniques can be used.

8

G. Cormode, M. Garofalakis, and D. Sacharidis

atomic AMS sketches (i.e., randomized counters), each with a different random hash function ξ(). The following theorem summarizes the key property of AMS sketches for stream-query estimation, where ||v||2 denotes the L2 -norm of a vector v, so ||v||2 = 2 v, v = i v[i] . Theorem 1 ([1, 2]). Consider two (possibly streaming) data vectors a and b, and let 2 Z denote the O(log(1/δ))-wise median )-wise means of independent copies of O(1/ of the atomic AMS sketch product ( i a[i]ξj (i))( i b[i]ξj (i)). Then, |Z − a, b| ≤ ||a||2 ||b||2 with probability ≥ 1 − δ. ) atomic counters we can apThus, using AMS sketches comprising only O( log(1/δ) 2 proximate the vector inner product a, b to within ±||a||2 ||b||2 (hence implying an -relative error estimate for ||a||22 ). 2.2 Discrete Wavelet Transform Basics The Discrete Wavelet Transform (DWT) is a useful mathematical tool for hierarchically decomposing functions in ways that are both efficient and theoretically sound. Broadly speaking, the wavelet decomposition of a function consists of a coarse overall approximation together with detail coefficients that influence the function at various scales [22]. Haar wavelets represent the simplest DWT basis: they are conceptually simple, easy to implement, and have proven their effectiveness as a data-summarization tool in a variety of settings [4, 24, 10]. One-Dimensional Haar Wavelets. Consider the one-dimensional data vector a = [2, 2, 0, 2, 3, 5, 4, 4] (N = 8). The Haar DWT of a is computed as follows. We first average the values together pairwise to get a new “lower-resolution” representation 0+2 3+5 4+4 of the data with the pairwise averages [ 2+2 2 , 2 , 2 , 2 ] = [2, 1, 4, 4]. This averaging loses some of the information in a. To restore the original a values, we need detail coefficients, that capture the missing information. In the Haar DWT, these detail coefficients are the differences of the (second of the) averaged values from the computed pairwise average. Thus, in our simple example, for the first pair of averaged values, the 0−2 detail coefficient is 0 since 2−2 2 = 0, for the second it is −1 since 2 = −1. No information is lost in this process – one can reconstruct the eight values of the original data array from the lower-resolution array containing the four averages and the four detail coefficients. We recursively apply this pairwise averaging and differencing process on the lower-resolution array of averages until we reach the overall average, to get the full Haar decomposition. The final Haar DWT of a is given by wa = [11/4, −5/4, 1/2, 0, 0, −1, −1, 0], that is, the overall average followed by the detail coefficients in order of increasing resolution. Each entry in wa is called a wavelet coefficient. The main advantage of using wa instead of the original data vector a is that for vectors containing similar values most of the detail coefficients tend to have very small values. Thus, eliminating such small coefficients from the wavelet transform (i.e., treating them as zeros) introduces only small errors when reconstructing the original data, resulting in a very effective form of lossy data compression [22]. A useful conceptual tool for visualizing and understanding the (hierarchical) Haar DWT process is the error tree structure [19] (shown in Fig. 1(a) for our example

Fast Approximate Wavelet Tracking on Streams c0

9

11/4

+

c1

l=0 c2

l=1 l=2

c4

l=3

2

−5/4

c3

1/2

c5

0

c6

−1

0

l=0 c7

−1

_

_

+

+

+ _

+

_

0

l=1 a[0]

2

a[1]

0

a[2]

2

a[3]

(a)

3

a[4]

5

a[5]

4

a[6]

4

a[7]

_ +

_+ +_

+_

_ +

_+ +_

_ +

+_

_+ +_

+_

_ +

_+ +_

+_

(b)

Fig. 1. Example error-tree structures for (a) a one-dimensional data array (N = 8), and (b) nonstandard two-dimensional Haar coefficients for a 4 × 4 data array (coefficient magnitudes are multiplied by +1 (−1) in the “+” (resp., “-”) labeled ranges, and 0 in blank areas)

array a). Each internal tree node ci corresponds to a wavelet coefficient (with the root node c0 being the overall average), and leaf nodes a[i] correspond to the original dataarray entries. This view allows us to see that the reconstruction of any a[i] depends only on the log N + 1 coefficients in the path between the root and a[i]; symmetrically, it means a change in a[i] only impacts its log N + 1 ancestors in an easily computable way. We define the support for a coefficient ci as the contiguous range of data-array that ci is used to reconstruct (i.e., the range of data/leaf nodes in the subtree rooted at ci ). Note that the supports of all coefficients at resolution level l of the Haar DWT are exactly the 2l (disjoint) dyadic ranges of size N/2l = 2log N −l over [N ], defined as Rl,k = [k · 2log N −l , . . . , (k + 1) · 2log N −l − 1] for k = 0, . . . , 2l − 1 (for each resolution level l = 0, . . . , log N ). The Haar DWT can also be conceptualized in terms of vector inner-product computations: let φl,k denote the vector with φl,k [i] = 2l−log N for i ∈ Rl,k and 0 otherwise, for l = 0, . . . , log N and k = 0, . . . , 2l − 1; then, each of the coefficients in the Haar DWT of a can be expressed as the inner product of a with one of the N distinct Haar wavelet basis vectors: 1 { (φl+1,2k − φl+1,2k+1 ) : l = 0, . . . , log N − 1; k = 0, . . . , 2l − 1} ∪ {φ0,0 } 2 Intuitively, wavelet coefficients with larger support carry a higher weight in the reconstruction of the original data values. To equalize the importance of all Haar DWT coefficients, a common normalization scheme is to scale the coefficient values at level l (or, equivalently, the basis vectors φl,k ) by a factor of N/2l . This normalization essentially turns the Haar DWT basis vectors into an orthonormal basis — letting c∗i denote the normalized coefficient values, this fact has two important consequences: (1) a vector is preserved in the wavelet domain, that is, ||a||22 = The2 energy of∗ the 2 i a[i] = i (ci ) (by Parseval’s theorem); and, (2) Retaining the B largest coefficients in terms of absolute normalized value gives the (provably) best B-term approximation in terms of Sum-Squared-Error (SSE) in the data reconstruction (for a given budget of coefficients B) [22]. Multi-Dimensional Haar Wavelets. There are two distinct ways to generalize the Haar DWT to the multi-dimensional case, the standard and nonstandard Haar decomposition [22]. Each method results from a natural generalization of the one-dimensional decomposition process described above, and both have been used in a wide variety of applications. Consider the case where a is a d-dimensional data array, comprising N d

10

G. Cormode, M. Garofalakis, and D. Sacharidis

entries. As in the one-dimensional case, the Haar DWT of a results in a d-dimensional wavelet-coefficient array wa with N d coefficient entries. The non-standard Haar DWT works in log N phases where, in each phase, one step of pairwise averaging and differencing is performed across each of the d dimensions; the process is then repeated recursively (for the next phase) on the quadrant containing the averages across all dimensions. The standard Haar DWT works in d phases where, in each phase, a complete 1-dimensional DWT is performed for each one-dimensional row of array cells along dimension k, for all k = 1, . . . , d. (full details and efficient decomposition algorithms are in [4, 24].) The supports of non-standard d-dimensional Haar coefficients are d-dimensional hyper-cubes (over dyadic ranges in [N ]d ), since they combine 1dimensional basis functions from the same resolution levels across all dimensions. The cross product of a standard d-dimensional coefficient (indexed by, say, (i1 , . . . , id )) is, in general a d-dimensional hyper-rectangle, given by the cross-product of the 1dimensional basis functions corresponding to coefficient indexes i1 , . . . , id . Error-tree structures can again be used to conceptualize the properties of both forms of d-dimensional Haar DWTs. In the non-standard case, the error tree is essentially a quadtree (with a fanout of 2d ), where all internal non-root nodes contain 2d−1 coefficients that have the same support region in the original data array but with different quadrant signs (and magnitudes) for their contribution. For standard d-dimensional Haar DWT, the error-tree structure is essentially a “cross-product” of d one-dimensional error trees with the support and signs of coefficient (i1 , . . . , id ) determined by the product of the component one-dimensional basis vectors (for i1 , . . . , d). Fig. 1(b) depicts a simple example error-tree structure for the non-standard Haar DWT of a 2-dimensional 4 × 4 data array. It follows that updating a single data entry in the d-dimensional data array a impacts the values of (2d − 1) log N + 1 = O(2d log N ) coefficients in the non-standard case, and (log N + 1)d = O(logd N ) coefficients in the standard case. Both multi-dimensional decompositions preserve the orthonormality, thus retaining the largest B coefficient values gives a provably SSE-optimal B-term approximation of a.

3 Problem Formulation and Overview of Approach Our goal is to continuously track a compact B-coefficient wavelet synopsis under our general, high-speed update-stream model. We require our solution to satisfy all three key requirements for streaming algorithms outlined earlier in this paper, namely: (1) sublinear synopsis space, (2) sublinear per-item update time, and (3) sublinear query time, where sublinear means polylogarithmic in the domain size N . As in [11], our algorithms return only an approximate synopsis comprising (at most) B Haar coefficients that is provably near-optimal (in terms of the captured energy of the underlying vector) assuming that our vector satisfies the “small-B property” (i.e., most of its energy is concentrated in a small number of Haar DWT coefficients) — this assumption is typically satisfied for most real-life data distributions [11]. The streaming algorithm presented by Gilbert et al. [11] (termed “GKMS” in the remainder of the paper) focuses primarily on the one-dimensional case. The key idea is to maintain an AMS sketch for the streaming data vector a (as discussed in Sec. 2.1). To produce the approximate B-term representation, GKMS employs the constructed

Fast Approximate Wavelet Tracking on Streams

11

sketch of a to estimate the inner product of a with all wavelet basis vectors, essentially performing an exhaustive search over the space of all wavelet coefficients to identify important ones. Although techniques based on range-summable random variables constructed using Reed-Muller codes were proposed to reduce or amortize the cost of this exhaustive search by allowing the sketches of basis vectors to be computed more quickly, the overall query time for discovering the top coefficients remains superlinear in N (i.e., at least Ω( 12 N log N )), violating our third requirement. For large data domains, say N = 232 ≈ 4 billion (such as the IP address domain considered in [11]), a query can take a very long time: over an hour, even if a million coefficient queries can be answered per second! This essentially renders a direct extension of the GKMS technique to multiple dimensions infeasible since it implies an exponential explosion in query cost (requiring at least O(N d ) time to cycle through all coefficients in d dimensions). In addition, the update cost of the GKMS algorithm is linear in the size of the sketch since the whole data structure must be “touched” for each update. This is problematic for high-speed data streams and/or even moderate sized sketch synopses. Our Approach. Our proposed solution relies on two key novel ideas to avoid the shortcomings of the GKMS technique. First, we work entirely in the wavelet domain: instead of sketching the original data entries, our algorithms sketch the wavelet-coefficient vector wa as updates arrive. This avoids any need for complex range-summable hash functions. Second, we employ hash-based grouping in conjunction with efficient binarysearch-like techniques to enable very fast updates as well as identification of important coefficients in polylogarithmic time. – Sketching in the Wavelet Domain. Our first technical idea relies on the observation that we can efficiently produce sketch synopses of the stream directly in the wavelet domain. That is, we translate the impact of each streaming update on the relevant wavelet coefficients. By the linearity properties of the DWT and our earlier description, we know that an update to the data entries corresponds to only polylogarithmically many coefficients in the wavelet domain. Thus, on receiving an update to a, our algorithms directly convert it to O(polylog(N )) updates to the wavelet coefficients, and maintain an approximate representation of the wavelet coefficient vector wa . – Time-Efficient Updates and Large-Coefficient Searches. Sketching in the wavelet domain means that, at query time, we have an approximate representation of the waveletcoefficient vector wa and need to be able to identify all those coefficients that are “large”, relative to the total energy of the data wa 22 = a 22 . While AMS sketches can give us these estimates (a point query is just a special case of an inner product), querying remains much too slow taking at least Ω( 12 N ) time to find which of the N coefficients are the B largest. Note that although a lot of earlier work has given efficient streaming algorithms for identifying high-frequency items [5, 6, 18], our requirements here are quite different. Our techniques must monitor items (i.e., DWT coefficients) whose values increase and decrease over time, and which may very well be negative (even if all the data entries in a are positive). Existing work on “heavy-hitter” tracking focuses solely on non-negative frequency counts [6] often assumed to be non-decreasing over time [5, 18]. More strongly, we must find items whose squared value is a large

12

G. Cormode, M. Garofalakis, and D. Sacharidis x h(id(x)) t repetitions b buckets

f(x) +u ξ (x) c subbuckets

Fig. 2. Our Group-Count Sketch (GCS) data structure: x is hashed (t times) to a bucket and then to a subbucket within the bucket, where a counter is updated

fraction of the total vector energy ||wa||22 : this is a stronger condition since such “L22 heavy hitters” may not be heavy hitters under the conventional sum-of-counts definition. 3 At a high level, our algorithms rely on a divide-and-conquer or binary-search-like approach for finding the large coefficients. To implement this, we need the ability to efficiently estimate sums-of-squares for groups of coefficients, corresponding to dyadic subranges of the domain [N ]. We then disregard low-energy regions and recurse only on high-energy groups — note that this guarantees no false negatives, as a group that contains a high-energy coefficient will also have high energy as a whole. Furthermore, our algorithms also employ randomized, hash-based grouping of dyadic groups and coefficients to guarantee that each update only touches a small portion of our synopsis, thus guaranteeing very fast update times.

4 Our Solution: The GCS Synopsis and Algorithms We introduce a novel, hash-based probabilistic synopsis data structure, termed GroupCount Sketch (GCS), that can estimate the energy (squared L2 norm) of fixed groups of elements from a vector w of size N under our streaming model. (To simplify the exposition we initially focus on the one-dimensional case, and present the generalization to multiple dimensions later in this section.) Our GCS synopsis requires small, sublinear space and takes sublinear time to process each stream update item; more importantly, we can use a GCS to obtain a high-probability estimate of the energy of a group within additive error ||w||22 in sublinear time. We then demonstrate how to use GCSs as the basis of efficient streaming procedures for tracking large wavelet coefficients. Our approach takes inspiration from the AMS sketching solution for vector L2 -norm estimation; still, we need a much stronger result, namely the ability to estimate L2 norms for a (potentially large) number of groups of items forming a partition of the data domain [N ]. A simple solution would be to keep an AMS sketch of each group separately; however, there can be many groups, linear in N , and we cannot afford to devote this much space to the problem. We must also process streaming updates as quickly as possible. Our solution is to maintain a structure that first partitions items of w into their group, and then maps groups to buckets using a hash function. Within each bucket, we apply a second stage of hashing of items to sub-buckets, each containing an atomic AMS sketch counter, in order to estimate the L2 norm of the bucket. In our 3

For example, consider a set of items with counts {4, 1, 1, 1, 1, 1, 1, 1, 1}. The item with count 4 represents 23 of the sum of the squared counts, but only 13 of the sum of counts.

Fast Approximate Wavelet Tracking on Streams

13

analysis, we show that this approach allows us to provide accurate estimates of the energy of any group in w with tight ±||w||22 error guarantees. The GCS Synopsis. Assume a total of k groups of elements of w that form a partition of [N ]. For notational convenience, we use a function id that identifies the specific group that an element belongs to, id : [N ] → [k]. (In our setting, groups correspond to fixed dyadic ranges over [N ] so the id mapping is trivial.) Following common data-streaming practice, we first define a basic randomized estimator for the energy of a group, and prove that it returns a good estimate (i.e., within ±||w||22 additive error) with constant probability > 12 ; then, by taking the median estimate over t independent repetitions, we are able to reduce the probability of a bad estimate to exponentially small in t. Our basic estimator first hashes groups into b buckets and then, within each bucket, it hashes into c sub-buckets. (The values of t, b, and c parameters are determined in our analysis.) Furthermore, as in AMS sketching, each item has a {±1} random variable associated with it. Thus, our GCS synopsis requires three sets of t hash functions, hm : [k] → [b], fm : [N ] → [c], and ξm : [N ] → {±1} (m = 1, . . . , t). The randomization requirement is that hm ’s and fm ’s are drawn from families of pairwise independent functions, while ξm ’s are four-wise independent (as in basic AMS); such hash functions are easy to implement, and require only O(log N ) bits to store. Our GCS synopsis s consists of t · b · c counters (i.e., atomic AMS sketches), labeled s[1][1][1] through s[t][b][c], that are maintained and queried as follows: U PDATE(i, u). Set s[m][hm (id(i))][fm (i)]+ = u · ξm (i), for each m = 1, . . . , t. c E STIMATE(GROUP). Return the estimate medianm=1,... ,t j=1 (s[m][hm (GROUP)][j])2 for the energy of the group of items GROUP ∈ {1, . . . , k} (denoted by GROUP 22 ). Thus, the update and query times for a GCS synopsis are simply O(t) and O(t · c), respectively. The following theorem summarizes our key result for GCS synopses. Theorem 2. Our Group-Count Sketch algorithms estimate the energy of item groups of the vector w within additive error ||w||22 with ≥ 1 − δ using probability space of O 13 log 1δ counters, per-item update time of O log 1δ , and query time of O 12 log 1δ . Proof. Fix a particular group GROUP and a row r in the GCS; we drop the row index m in the context where it is understood. Let BUCKET be the set of elements that hash into the same bucket as GROUP does: BUCKET = {i | i ∈ [1, n] ∧ h(id(i)) = h(GROUP)}. Among those, let COLL be the set of elements other than those of GROUP : COLL = {i | i ∈ [1, n] ∧ id(i) = GROUP ∧ h(id(i)) = h(GROUP)}. In the following, we abuse notation in that we refer to a refer to both a group and the set of items in the group with the same name. Also, we write S 22 to denote the sum of squares of the elements (i.e. 2 2 L2 ) in set S: S 2 = i∈S w[i]2 . Let est be the estimator for the sum of squares of the items of GROUP . That is, c est = j=1 estj where estj = (s[m][hm (GROUP )][j])2 is the square of the count in sub-bucket SUB j . The expectation of this estimator is, by simple calculation, the sum of squares of items in sub-bucket j, which is a fraction of the sum of squares of the bucket. Similarly, using linearity of expectation and the four-wise independence of the ξ hash functions, the variance of est is bounded in terms of the square of the expectation:

14

G. Cormode, M. Garofalakis, and D. Sacharidis

E[est] = E[ BUCKET 22 ]

Var[est] ≤ 2c E[ BUCKET 42 ]

To calculate E[ BUCKET 22 ], observe that the bucket contains items of GROUP as well as items from other groups denoted by the set COLL which is determined by h. Because of the pairwise independence of h, this expectation is bounded by a fraction of the total energy. Therefore: E[ BUCKET 22 ] = GROUP 22 + E[ COLL 22 ] ≤ GROUP 22 + 1b ||w||22 and E[ BUCKET 42 ] = GROUP 42 + E[ COLL 42 ] + 2 GROUP 22 E[ COLL 22 ] ≤ ||w||42 + 1b ||w||42 + 2||w||22 · 1b ||w||22 ≤ (1 + 3b )||w||42 ≤ 2||w||22 since GROUP 22 ≤ ||w||22 and b ≥ 3. The estimator’s expectation and variance satisfy E[est] ≤ GROUP 22 + 1b ||w||22 Var[est] ≤ 4c w 42 4 Applying the Chebyshev inequality we obtain Pr |est − E[est]| ≥ λ||w||22 ≤ 2 cλ and by setting c = λ322 the bound becomes 18 , for some parameter λ. Using the above bounds on varianceand expectation and the factthat|x − y| ≥ ||x| − |y|| we have, 1 1 |est − E[est]| ≥ est − GROUP 22 − ||w||22 ≥ est − GROUP 22 − ||w||22 . b b Consequently (note that Pr[|x| > y] ≥ Pr[x > y]), 1 1 Pr est − GROUP 22 − ||w||22 ≥ λ||w||22 ≤ Pr |est − E[est]| ≥ λ||w||22 ≤ b 8 est − GROUP 22 ≥ λ + 1 ||w||22 ≤ 1 . Setting b = 1 we get or equivalently, Pr b 8 λ Pr est − GROUP 22 ≥ 2λ||w||22 ≤ 18 and to obtain an estimator with ||w||22 additive error we require λ = 2 which translates to b = O( 1 ) and c = O( 12 ). By Chernoff bounds, the probability that the median of t independent instances of the estimator deviates by more than ||w||22 is less than e−qt , for some constant q. Setting this to the probability of failure δ, we require t = O log 1δ , which gives the claimed bounds.

Hierarchical Search Structure for Large Coefficients. We apply our GCS synopsis and estimators to the problem of finding items with large energy (i.e., squared value) in the w vector. Since our GCS works in the wavelet domain (i.e., sketches the wavelet coefficient vector), this is exactly the problem of recovering important coefficients. To efficiently recover large-energy items, we impose a regular tree structure on top of the data domain [N ], such that every node has the same degree r. Each level in the tree induces a partition of the nodes into groups corresponding to r-adic ranges, defined by the nodes at that level. 4 For instance, a binary tree creates groups corresponding to dyadic ranges of size 1, 2, 4, 8, and so on. The basic idea is to perform a search over the tree for those high-energy items above a specified energy threshold, φ||w||22 . Following the discussion in Section 3, we can prune groups with energy below the threshold and, thus, avoid looking inside those groups: if the estimated energy is accurate, then these cannot contain any high-energy elements. Our key result is that, using such a hierarchical search structure of GCSs, we can provably (within appropriate probability bounds) retrieve all items above the threshold plus a controllable error quantity ((φ+)||w||22 ), and retrieve no elements below the threshold minus that small error quantity ((φ − )||w||22 ). 4

Thus, the id function for level l is easily defined as idl (i) = i/r l .

Fast Approximate Wavelet Tracking on Streams

15

Theorem 3. Given a vector w of size N we can report, with high probability ≥ 1 − δ, all elements with energy above (φ + )||w||22 (where φ ≥ ) within additive error of 2 ||w|| 22 (and therefore, report no item with energy below (φ

− )||w||2 ) using space logr N r logr N r N , per item processing time of O logr N · log r log and · log φδ of O 3 φδ

r N query time of O φr 2 · logr N · log r log . φδ Proof. Construct logr N GCSs (with parameters to be determined), one for each level of our r-ary search-tree structure. We refer to an element that has energy above φ||w||22 as a “hot element”, and similarly groups that have energy above φ||w||22 as “hot ranges”. The key observation is that all r-adic ranges that contain a hot element are also hot. Therefore, at each level (starting with the root level), we identify hot r-adic ranges by examining only those r-adic ranges that are contained in hot ranges of the previous level. Since there can be at most φ1 hot elements, we only have to examine at most 1 φ logr N ranges and pose that many queries. Thus, we require the failure probability log N

r to be φδ for each query so that, by the union bound, we obtain a failure probability of at most δ for reporting all hot elements. Further, we require each level to be accurate within ||w||22 so that we obtain all hot elements above (φ + )||w||22 and none below

(φ − )||w||22 . The theorem follows.

Setting the value of r gives a tradeoff between query time and update time. Asymptotically, we see that the update time decreases as the degree of the tree structure, r, increases. This becomes more pronounced in practice, since it usually suffices to set t, the number of tests, to a small constant. Under this simplification, the update cost essentially reduces to O(logr N ), and the query time reduces to O( 2rφ logr N ). (We will see this clearly in our experimental analysis.) The extreme settings of r are 2 and N : r = 2 imposes a binary tree over the domain, and gives the fastest query time but O(log2 N ) time per update; r = N means updates are effectively constant O(1) time, but querying requires probing the whole domain, a total of N tests to the sketch. Sketching in the Wavelet Domain. As discussed earlier, given an input update stream for data entries in a, our algorithms build GCS synopses on the corresponding wavelet coefficient vector wa , and then employ these GCSs to quickly recover a (provably good) approximate B-term wavelet representation of a. To accomplish the first step, we need an efficient way of “translating” updates in the original data domain to the domain of wavelet coefficients (for both one- and multi-dimensional data streams). – One-Dimensional Updates. An update (i, v) on a translates to the following collection of log N + 1 updates to wavelet coefficients (that lie on the path to leaf a[i], Fig. 1(a)):

1 l 0, 2− 2 log N v , 2log N −l + k, (−1)k mod 2 2− 2 v : for each l = 0,. . ., log N − 1 , where l = 0, . . . , log N − 1 indexes the resolution level, and k = i2−l . Note that each coefficient update in the above set is easily computed in constant time. – Multi-Dimensional Updates. We can use exactly the same reasoning as above to produce a collection of (constant-time) wavelet-coefficient updates for a given data update in d dimensions (see, Fig. 1(b)). As explained in Section 2.2, the size of this collection of updates in the wavelet domain is O(logd N ) and O(2d log N ) for standard and

16

G. Cormode, M. Garofalakis, and D. Sacharidis

non-standard Haar wavelets, respectively. A subtle issue here is that our search-tree structure operates over a linear ordering of the N d coefficients, so we require a fast method for linearizing the multi-dimensional coefficient array — any simple linearization technique will work (e.g., row-major ordering or other space-filling curves). Using GCSs for Approximate Wavelets. Recall that our goal is to (approximately) recover the B most significant Haar DWT coefficients, without exhaustively searching through all coefficients. As shown in Theorem 3, creating GCSs for for dyadic ranges over the (linearized) wavelet-coefficient domain, allows us to efficiently identify high-energy coefficients. (For simplicity, we fix the degree of our search structure to r = 2 in what follows.) An important technicality here is to select the right threshold for coefficient energy in our search process, so that our final collection of recovered coefficients provably capture most of the energy in the optimal B-term representation. Our analysis in the following theorem shows how to set this threshold, an proves that, for data vectors satisfying the “small-B property”, our GCS techniques can efficiently track near-optimal approximate wavelet representations. (We present the result for the standard form of the multi-dimensional Haar DWT — the one-dimensional case follows as the special case d = 1.) Theorem 4. If a d-dimensional data stream over the [N ]d domain has a B-term standard wavelet representation with energy at least η||a||22 , where ||a||22 is the entire energy, then our GCS algorithms can estimate an at-most-B-term standard wavelet represen3 N log N · log Bdηδ ), per tation with energy at least (1 − )η||a||22 using space of O( B d3log η3 item processing time of O(d logd+1 N · log

Bd log N ), ηδ

3

and query time of O( B3 ηd3 · log N ·

log N ). log Bdηδ

Proof. Use our GCS search algorithm and Theorem 3 to find all coefficients with energy η 2 2 2 at least η B ||a||2 = B ||w||2 . (Note that ||a||2 can be easily estimated to within small relative error from our GCSs.) Among those choose the highest B coefficients; note that there could be less than B found. For those coefficients selected, observe we incur two types of error. Suppose we choose a coefficient which is included in the best B2 term representation, then we could be inaccurate by at most η B ||a||2 . Now, suppose we choose coefficient c1 which is not in the best B-term representation. There has to be a coefficient c2 which is in the best B-term representation, but was rejected in favor of 2 c1 . For this rejection to have taken place their energy must differ by at most 2 η B ||a||2 by our bounds on the accuracy of estimation for groups of size 1. Finally, note that for any coefficient not chosen (for the case when we pick fewer than B coefficients) its true 2 energy must be less than 2 η B ||a||2 . It follows that the total energy we obtain is at most 2 2η||a||2 less than that of the best B-term representation. Setting parameters λ, , N d of Theorem 3 to λ = = η B and N = N we obtain the stated space and query time bounds. For the per-item update time, recall that a single update in the original data

domain requires O(logd N ) coefficient updates. The corresponding result for the non-standard Haar DWT follows along the same lines. The only difference with Theorem 4 comes in the per-update processing time which, in log N the non-standard case, is O(d2d log N · log Bdηδ ).

Fast Approximate Wavelet Tracking on Streams 100000

17

1e+07

per-item update time (µsecs)

1e+06 GKMS GCS-1 GCS-2 GCS-4 GCS-8 GCS-logn fast-GKMS

1000

100

query time (msecs)

10000 100000

GKMS fast-GKMS GCS-logn GCS-8 GCS-4 GCS-2 GCS-1

10000 1000 100

10 10 1

1 14

16

18

20

22

24

26

28

30

14

16

log of domain size

18

20

22

24

26

28

30

log of domain size

(a) Per-Item Update Time against domain size

(b) Query Time against domain size

10000

1e+07

1000

GCS-1 GCS-2 GCS-4 GCS-8 GCS-logn fast-GKMS

100

query time (msecs)

per-item update time (µsecs)

1e+06 100000

fast-GKMS GCS-logn GCS-8 GCS-4 GCS-2 GCS-1

10000 1000 100 10

10

1 360KB

1.2MB

2.9MB

sketch size

360KB

1.2MB

2.9MB

sketch size

(c) Per-Item Update Time against space

(d) Query Time against space

Fig. 3. Performance on one-dimensional data

5 Experiments Data Sets and Methodology. We implemented our algorithms in a mixture of C and C++, for the Group-Count sketch (GCS) with variable degree. For comparison we also implemented the method of [11] (GKMS) as well as a modified version of the algorithm with faster update performance using ideas similar to those in the Group-Count sketch, which we denote by fast-GKMS. Experiments were performed on a 2GHz processor machine, with 1GB of memory. We worked with a mixture of real and synthetic data: – Synthetic Zipfian Data was used to generate data from arbitrary domain sizes and with varying skewness. By default the skewness parameter of the distribution is z = 1.1. – Meteorological data set5 comprised of 105 meteorological measurements. These were quantized and projected appropriately to generate data sets with dimensionalities between 1 and 4. For the experiments described here, we primarily made use of the AirTemperature and WindSpeed attributes to obtain 1- and 2-dimensional data streams. In our experiments, we varied the domain size, the size of the sketch6 and the degree of the search tree of our GCS method and measured (1) per-item update time, (2) query 5 6

http://www-k12.atmos.washington.edu/k12/grayskies/ In each experiment, all methods are given the same total space to use.

18

G. Cormode, M. Garofalakis, and D. Sacharidis

time and (3) accuracy. In all figures, GCS-k denotes that the degree of the search tree is 2k ; i.e. GCS-1 uses a binary search tree, whereas GCS-logn uses an n-degree tree, and so has a single level consisting of the entire wavelet domain.

0.55

0.45

0.5

0.4

0.45

0.35

0.35

GCS-1 fast-GKMS GCS-logn offline

0.3 0.25 0.2

0.3

sse/energy

sse/energy

0.4 GCS-1 GCS-logn offline

0.25 0.2 0.15

0.15 0.1

0.1

0.05

0.05 0

0 0

5

10

15

20

25

30

number of wavelet coefficients

(a) z=1.1

35

40

0

5

10

15

20

25

30

35

40

number of wavelet coefficients

(b) z=1.4

Fig. 4. Accuracy of Wavelet Synopses

One-Dimensional Experiments. In the first experimental setup we used a synthetic 1-dimensional data stream with updates following the Zipfian distribution (z = 1.1). Space was increased based on the log of the dimension, so for log N = 14, 280KB was used, up to 600KB for log N = 30. Figure 3 (a) shows the per-item update time for various domain sizes, and Figure 3 (b) shows the time required to perform a query, asking for the top-5 coefficients. The GKMS method takes orders of magnitude longer for both updates and queries, and this behavior is seen in all other experiments, so we do not consider it further. Apart from this, the ordering (fastest to slowest) is reversed between update time and query time. Varying the degree of the search tree allows update time and query time to be traded off. While the fast-GKMS approach is the fastest for updates, it is dramatically more expensive for queries, by several orders of magnitude. For domains of size 222 , it takes several hours to recover the coefficients, and extrapolating to a 32 bit domain means recovery would take over a week. Clearly this is not practical for realistic monitoring scenarios. Although GCS-logn also performs exhaustive search over the domain size, its query times are significantly lower as it does not require a sketch construction and inner-product query per wavelet coefficient. Figures 3 (c) and (d) show the performance as the sketch size is increased. The domain size was fixed to 218 so that the fast-GKMS method would complete a query in reasonable time. Update times do not vary significantly with increasing space, in line with our analysis (some increase in cost may be seen due to cache effects). We also tested the accuracy of the approximate wavelet synopsis for each method. We measured the SSE-to-energy ratio of the estimated B-term synopses for varying B and varying zipf parameter and compared it against the optimal B-term synopsis computed offline. The results are shown in Figures 4 (a) and (b), where each sketch was given space 360KB. In accordance to analysis (GCS requires O( 1 ) times more space to provide the same guarantees with GKMS) the GCS method is slightly less accurate when estimating more than the top-15 coefficients. However, experiments showed that increasing the size to 1.2MB resulted in equal accuracy. Finally we tested the performance of our methods

Fast Approximate Wavelet Tracking on Streams 1e+06

10000

update query

100000 10000

19

S-update NS-update S-query NS-query

1000

time (msecs)

time (msecs)

1000 100 10 1

100

10

0.1 0.01

1

0.001 0.0001

0.1 fast- GCS-1 GCS-2 GCS-4 GCS-6 GCS-8 GKMS

method

GCSlogn

1

1e+06

4

(b) Synthetic data in multi-d. 1e+06

update query

update query

100000

10000

10000

1000

1000

time (msecs)

time (msecs)

3

dimensions

(a) Real data in 1-d

100000

2

100 10 1 0.1

100 10 1 0.1

0.01

0.01

0.001

0.001

0.0001

0.0001 fast- GCS-1 GCS-2 GCS-4 GCS-6 GCS-8 GKMS

method

GCSlogn

(c) Real data in 2-d (standard DWT)

fast- GCS-1 GCS-2 GCS-4 GCS-6 GCS-8 GKMS

method

GCSlogn

(d) Real data in 2-d (Non-standard DWT).

Fig. 5. Performance on 1-d Real Data and multi-d Real and Synthetic Data

on single dimensional meteorological data of domain size 220 . Per-item and query times in Figure 5 (a) are similar to those on synthetic data. Multi-Dimensional Experiments. We compared the methods for both wavelet decomposition types in multiple dimensions. First we tested our GCS method for a synthetic dataset (z = 1.1, 105 tuples) of varying dimensionality. In Figure 5 (b) we kept the total domain size constant at 224 while varying the dimensions between 1 and 4. The per-item update time is higher for the standard decomposition, as there are more updates on the wavelet domain per update on the original domain. The increase in query time can be attributed to the increasing sparseness of the domain as the dimensionality increases which makes searching for big coefficients harder. This is a well known effect of multidimensional standard and non-standard decompositions. For the real dataset, we focus on the two dimensional case; higher dimensions are similar. Figure 5(c) and (d) show results for the standard and non-standard respectively. The difference between GCS methods and fast-GKMS is more pronounced, because of the additional work in producing multidimensional wavelet coefficients, but the query times remain significantly less (query times were in the order of hours for fast-GKMS), and the difference becomes many times greater as the size of the data domain increases. Experimental Summary. The Group-Count sketch approach is the only method that achieves reasonable query times to return an approximate wavelet representation of

20

G. Cormode, M. Garofalakis, and D. Sacharidis

data drawn from a moderately large domain (220 or larger). Our first implementation is capable of processing tens to hundreds of thousands of updates per second, and giving the answer to queries in the order of a few seconds. Varying the degree of the search tree allows a tradeoff between query time and update time to be established. The observed accuracy is almost indistinguishable from the exact solution, and the methods extend smoothly to multiple dimensions with little degradation of performance.

6 Related Work Wavelets have a long history of successes in the signal and image processing arena [16, 22] and, recently, they have also found their way into data-management applications. Matias et al. [19] first proposed the use of Haar-wavelet coefficients as synopses for accurately estimating the selectivities of range queries. Vitter and Wang [24] describe I/O-efficient algorithms for building multi-dimensional Haar wavelets from large relational data sets and show that a small set of wavelet coefficients can efficiently provide accurate approximate answers to range aggregates over OLAP cubes. Chakrabarti et al. [4] demonstrate the effectiveness of Haar wavelets as a general-purpose approximate query processing tool by designing efficient algorithms that can process complex relational queries (with joins, selections, etc.) entirely in the wavelet-coefficient domain. Schmidt and Shahabi [21] present techniques using the Daubechies family of wavelets to answer general polynomial range-aggregate queries. Deligiannakis and Roussopoulos [8] introduce algorithms for building wavelet synopses over data with multiple measures. Finally, I/O efficiency issues are studied by Jahangiri et al. [15] for both forms of the multi-dimensional DWT. Interest in data streams has also increased rapidly over the last years, as more algorithms are presented that provide solutions in a streaming one-pass, low memory environment. Overviews of data-streaming issues and algorithms can be found, for instance, in [3, 20]. Sketches first appeared for estimating the second frequency moment of a set of elements [2] and have since proven to be a useful summary structure in such a dynamic setting. Their application includes uses for estimating join sizes of queries over streams [1, 9], maintaining wavelet synopses [11], constructing histograms [12, 23], estimating frequent items [5, 6] and quantiles [13]. The work of Gilbert et al. [11] for estimating the most significant wavelet coefficients is closely related to ours. As we discuss, the limitation is the high query time required for returning the approximate representation. In follow-up work, the authors proposed a more theoretical approach with somewhat improved worst case query times [12]. This work considers an approach based on a complex construction of range-summable random variables to build sketches from which wavelet coefficients can be obtained. The update times remain large. Our bounds improve those that follow from [12], and our algorithm is much simpler to implement. In similar spirit, Thaper et al. [23] use AMS sketches to construct an optimal B-bucket histogram of large multi-dimensional data. No efficient search techniques are used apart from an exhaustive greedy heuristic which always chooses the next best bucket to include in the histogram; still, this requires an exhaustive search over a huge space. The idea of using group-testing techniques to more efficiently find heavy items appears in several prior works [6, 7, 12]; here, we show that it is possible to apply similar

Fast Approximate Wavelet Tracking on Streams

21

ideas to groups under L2 norm, which has not been explored previously. Recently, different techniques have been proposed for constructing wavelet synopses that minimize non-Euclidean error metrics, under the time-series model of streams [14, 17].

7 Conclusions We have proposed the first known streaming algorithms for space- and time-efficient tracking of approximate wavelet summaries for both one- and multi-dimensional data streams. Our approach relies on a novel, Group-Count Sketch (GCS) synopsis that, unlike earlier work, satisfies all three key requirements of effective streaming algorithms, namely: (1) polylogarithmic space usage, (2) small, logarithmic update times (essentially touching only a small fraction of the GCS for each streaming update); and, (3) polylogarithmic query times for computing the top wavelet coefficients from the GCS. Our experimental results with both synthetic and real-life data have verified the effectiveness of our approach, demonstrating the ability of GCSs to support very high speed data sources. As part of our future work, we plan to extend our approach to the problem of extended wavelets [8] and histograms [23].

References 1. N. Alon, P. B. Gibbons, Y. Matias, and M. Szegedy. “Tracking join and self-join sizes in limited storage”. In ACM PODS, 1999. 2. N. Alon, Y. Matias, and M. Szegedy. “The space complexity of approximating the frequency moments”. In ACM STOC, 1996. 3. B. Babcock, S. Babu, M. Datar, R. Motwani, and Jennifer Widom. “Models and issues in data stream systems”. In ACM PODS, 2002. 4. K. Chakrabarti, M. N. Garofalakis, R. Rastogi, and K. Shim. “Approximate query processing using wavelets”. In VLDB, 2000. 5. M. Charikar, K. Chen, and M. Farach-Colton. “Finding frequent items in data streams”. In ICALP, 2002. 6. G. Cormode and S. Muthukrishnan. “What’s hot and what’s not: Tracking most frequent items dynamically”. In ACM PODS, 2003. 7. G. Cormode and S. Muthukrishnan. “What’s new: Finding significant differences in network data streams”. In IEEE Infocom, 2004. 8. A. Deligiannakis and N. Roussopoulos. “Extended wavelets for multiple measures”. In ACM SIGMOD, 2003. 9. A. Dobra, M. N. Garofalakis, J. Gehrke, and R. Rastogi. “Processing complex aggregate queries over data streams”. In ACM SIGMOD, 2002. 10. M. Garofalakis and A. Kumar. “Deterministic Wavelet Thresholding for Maximum-Error Metrics”. In ACM PODS, 2004. 11. A. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss. “One-pass wavelet decomposition of data streams”. IEEE TKDE, 15(3), 2003. 12. A. Gilbert, S. Guha, P. Indyk, Y. Kotidis, S. Muthukrishnan, and M. Strauss. “Fast, smallspace algorithms for approximate histogram maintenance”. In ACM STOC, 2002. 13. A. Gilbert, Y. Kotidis, S. Muthukrishnan, and M. Strauss. “How to summarize the universe: Dynamic maintenance of quantiles”. In VLDB, 2002. 14. S. Guha and B. Harb. “Wavelet Synopsis for Data Streams: Minimizing non-Euclidean Error” In KDD, 2005.

22

G. Cormode, M. Garofalakis, and D. Sacharidis

15. M. Jahangiri, D. Sacharidis, and C. Shahabi. “Shift-Split: I/O efficient maintenance of wavelet-transformed multidimensional data”. In ACM SIGMOD, 2005. 16. B. Jawerth and W. Sweldens. “An Overview of Wavelet Based Multiresolution Analyses”. SIAM Review, 36(3), 1994. 17. P. Karras and N. Mamoulis. “One-pass wavelet synopses for maximum-error metrics”. In VLDB, 2005. 18. G.S. Manku and R. Motwani. “Approximate frequency counts over data streams”. In VLDB, 2002. 19. Y. Matias, J.S. Vitter, and M. Wang. “Wavelet-based histograms for selectivity estimation”. In ACM SIGMOD, 1998. 20. S. Muthukrishnan. Data streams: algorithms and applications. In SODA, 2003. 21. R.R. Schmidt and C. Shahabi. “Propolyne: A fast wavelet-based technique for progressive evaluation of polynomial range-sum queries”. In EDBT, 2002. 22. E. J. Stollnitz, T. D. Derose, and D. H. Salesin. “Wavelets for computer graphics: theory and applications”. Morgan Kaufmann Publishers, 1996. 23. N. Thaper, S. Guha, P. Indyk, and N. Koudas. “Dynamic multidimensional histograms”. In ACM SIGMOD, 2002. 24. J.S. Vitter and M. Wang. “Approximate computation of multidimensional aggregates of sparse data using wavelets”. In ACM SIGMOD, 1999.