Stochastic Data Streams

S. Muthukrishnan

Talk Overview: Triptych 

Classical Data Stream Algorithms 



Probabilistic Data Stream Algorithms 



What is well understood What may be reducible to above

Stochastic Data Stream Algorithms 

What needs to be explored more.

The Basic Problem in Data Streams i Dist F.



Updates: F[i]++. F[i]--.



Problem: 



F[i] = ?

Use O(log n) space.

n

Data Streams: Motivation 

Update/query times should be sublinear, like polylog(n), because data arrives very fast.



Storage space, communication should be sublinear because ultra fast memory is expensive and power overhead.



Applications: 

IP network monitoring.



Sensors data analysis.

Method: Count-Min Sketch [CM06] Update: COUNT [ j , h j (i )] + + h1(i)

+c +c

i,c +c

hlog 1/δ(i)

+c

e ε

Estimate:

1 log δ

~ F [i ] = min j COUNT [ j , h j (i )]

Count-Min Sketch

~ F[i ] ≤ F[i ]With probability at least ~ F[i ] ≤ F[i ] + ε ∑ i F [i ]



Claim:



Space used is



O ( (1 / ε ) log(1 / δ ) ) Time per update is O (log(1 / δ )) 2 In contrast, need Ω (1 / ε ) space for norm embedding

1− δ

Count-Min Sketch Proof 

Claim: With probability at least

~ F[i ] ≤ F[i ] + ε

~ Pr( F [i ] > F[i ] + ε



i

1− δ



F[i] i

F[i ]) = Pr (∀ j F[i ] + X i , j > F[i ] + ε

ε E ( X i , j ) = ∑ i F [i ] e

~ Pr( F [i ] > F[i ] + ε



i



i

F[i ])

Pairwise h’s suffice.

F [i ]) = Pr(∀ j X i , j ≥ e E ( X i , j )) < e − (log(1/ δ ))

The Challenge 1000000 items inserted

999996 items deleted 4 items left

Summary Maintained

Recovering items to ±0.1 ∑iF[i] accuracy => retrieve each item precisely.

Improving CM Sketch? 

Index Problem 





A has n long bitstring and sends messages to B who wishes to compute the ith bit. Needs Ω(n) bits of communication.

Reduction of estimating F[i] in data stream model. 

I [1…1/2ε]



I[i] = 1 -> F[i]=2;



I[i]=0 -> F[i]=0; F[0]<-F[0]+2



Estimating F[i] to ε||F||=1 accuracy reveals I[i].

Summary of Data Stream Algorithms 

CM Sketch can be used for estimating 

Frequency moments, F2 = ∑i F[i]2 with space O(1/ε2).



Heavy hitters, F[i] ≥ φ ∑i F[i] with space O((1/ φ) log n) and update time O(log n).



Quantiles, ∑i


Inner product of two vectors, ∑i F[i] G[i]



Sparse representations like histograms, wavelets, compressed sensing of signals.



CM Sketch suffices for many tasks on vector data.



More work for clustering, graph, matrix streams.

References 

An improved data stream summary: The count-min sketch and its applications. Cormode and Muthukrishnan. JALG 04



Data Streams: Algorithms and Applications. Muthukrishnan. NOW Publishers. 2005.



Lecture Notes.





Spring School, Muthukrishnan and McGregor, Barbados 09.



Massive Data Algorithms, Indyk. MIT. 2007.

Open problems in data streams. McGregor, IITK Wkshp 07.

Probabilistic Data Streams

Probabilistic Stream Model 



Simplest model: 

A stream of pairs 〈ti, pi〉, ti ∈[1…n], prob pi, i ∈ [1,m]



With probability pi, ti is in the stream, else empty.

Example: S = (〈x, ½〉, 〈y, 1/3〉, 〈y, ¼〉) 

Encodes 6 “possible worlds” streams: P(S) = {φ, (x), (y), (x, y), (y, y), (x, y, y)}



Can compute probabilities of each possible stream: G

φ

x

y

x,y

y,y

x,y,y

Pr[G]

¼

¼

5/24

5/24

1/24

1/24

Probabilistic Stream Computations 



Challenges: 

expensive to track all possible worlds



expensive to track all tuples in streams

Want to compute aggregate functions over prob. streams 

Given function F, find expected value: E(F(S)) = ∑G∈P(S)) Pr[G] F(G)



Also compute variance to bound deviation: Var(F(S)) = E(F2(S)) – E2(F(S))

Probabilistic Data Streams: Motivation 

Many sources of probabilistic inaccuracy: 

Sensor measurements, eg., noisy RFID readings.



Data quality, eg., quality of record linkages.



Labeling data with machine learning gives derived probabilistic streams, eg., conf in extracted rules.

Probabilistic Data Streams: Example. 

COUNT = E[ | {i ∈ [m]: ti = not empty} | ]



MEDIAN = x such that

max( E[| {i ∈ [m], ti < x} |], E[| {i ∈ [m], ti > x} |]) ≤ COUNT / 2

Probabilistic Medians: Algorithm 

For each input 〈ti, pi〉, put └2mpi/ε┘ copies of ti in S’.



Find l such that



We have └2mpi/ε┘ / 2m/ε ≥ pi – ε/2m. Hence, dividing by

1 ε max(| {i | ti < l} |], | {i | ti > l} |) ≤ ( + ) | S ' | 2 2

2m/ε

ε 1 ε max( ∑ pi , ∑ pi ) − ≤ ( + )COUNT 2 2 2 1≤ ti < l l < ti < n

Probabilistic Data Streams: Summary 

DISTINCT: For each item in prob stream, produce many distinct copies in classical stream.



Frequency Moments, F2: randomly instantiate each item in a classical stream. Bound variance.



COUNT F1 – E(F1(S)) is expected length of stream 

E(F1(S)) = ∑i pi (sum of Bernoulli variables)



Var(F1(S)) = ∑i pi(1-pi) (sum of variances)



SUM =∑i tipi is trivial



MEAN is sorta tricky. MEAN is not SUM/COUNT.



CLUSTER is interesting. k-center is nonlinear.

References



Sketching probabilistic data streams. Cormode and Garofalakis. SIGMOD 07.



Estimating statistical aggregates on probabilistic data streams. Jayram, McGregor, Muthukrishnan and Vee. PODS 07.



Exceeding expectations and clustering uncertain data. Guha and Munagala. PODS 09.

Stochastic Data Streams

Alerting the MAX on Stochastic Stream 

Distribution D given ahead of time. Input is a stochastic stream x1, x2, …, xn, each xi is drawn from D. n is known.



Stop at input t and output xt.



Goal: maximize xt. Formally, max E(xt). Even more formally,

E(x t ) max E (OPT ) = E (max i xi )



Can a priori look at the dist of maxi xi



Not the same as finding maxi xi.

Alerting MAX: Result 

An algorithm that finds t such that E(xt)/ E(OPT) ≥ ½.



Ingredient: Prophet Inequality.



Algorithm: 

x*= maxi xi



m: median of x*. Pr(x*m) ≤ 1/2.



τ: smallest t such that xt > m.



τ can be determined on the stream,and gives the result.



Detail: ∀ τ’: smallest t such that xt ≥ m. ∀ τ or τ’ gives the result. Simple rule to determine which.

References 

Stochastic data streams, Muthukrishnan, MFCS 09.



A survey of prophet inequalities in optimal stopping theory. Hill, T.P. and Kertz, R.P. Contemporary Mathematics, AMS, Vol. 125, pp. 191-207, 1992.



On semimarts, amarts and processes with finite value. Krengel, U. and Sucheston, L.. Prob. on Banach Spaces, 1978, pp. 197-266.



Comparison of threshold stop rules and maximum for independent nonnegative random variables, Samuel-Cahn, E., Ann. Probab. 12, 1988. pp. 1213-1216.

Problem: Stochastic clustering (on streams) 

Given a distribution D in [0,1] and integer k.



Points arrive online p1,…,pt, each drawn from D.



Improve over streaming k-center algorithms on p1, …,pt in space, accuracy, whatever.



Simple Exercise. Find median (not k-center).

Summary 

Classical data stream model: 



Probabilistic stream model: 



Well understood. Still technical results remain, eg., lower bounds. In simple cases, can be reduced to classical streams. More complex problems are difficult. Logic and complexity.

Stochastic stream model: 

This talk/paper defines the model and makes a start. Lot remains to be done.

Stochastic Data Streams

Stochastic Data Stream Algorithms. ○ What needs to be ... Storage space, communication should be sublinear .... Massive Data Algorithms, Indyk. MIT. 2007.

149KB Sizes 1 Downloads 238 Views

Recommend Documents

Stochastic Benefit Streams, Learning, and Technology ...
physiological pathways that require scientists to “dial into the physiology of ..... Whose numbers count? ... technology in Developing Countries” conference at UC.

Stochastic Benefit Streams, Learning, and Technology ...
Introduction. The Green Revolution brought spectacular yield gains to many crops in many parts of the developing world. In the current generation of agricultural ...

Clustering in Data Streams
Small(er)-Space Algorithm (cont'd). • Application in data stream model. − Input m (a multiple of 2k) points at a time. − Reduce the first m points to 2k medians. − Maintain at most m level-i medians. − On seeing m, generate 2k level-(i+1) m

Scalable Regression Tree Learning in Data Streams
In the era of Big data, many classic ... novel regression tree learning algorithms using advanced data ... different profiles that best describe the data distribution.

Frequent Pattern Mining over data streams
IJRIT International Journal of Research in Information Technology, Volume 1, Issue 5, May ... U.V.Patel College of Engineering, Ganpat University, Gujarat, India.

Wavelet Synopsis for Data Streams: Minimizing ... - Semantic Scholar
Aug 24, 2005 - Permission to make digital or hard copies of all or part of this work for personal or ... opsis or signature. These synopses or signatures are used.

From Data Streams to Information Flow: Information ...
multimodal fine-grained behavioral data in social interactions wherein a .... processing tools developed in our previous work. ..... developing data management and preprocessing software. ... workshop on research issues in data mining and.

STAGGER: Periodicity Mining of Data Streams ... - Research
continuously, the sliding windows expand in length in order to cover the whole ...... sales transactions for some stores over a period of 15 months serves the ...

Summarizing and Mining Skewed Data Streams - DIMACS - Rutgers ...
ces. In Workshop on data mining in resource constrained en- vironments at SIAM Intl Conf on Data mining, 2004. [33] E. Kohler, J. Li, V. Paxson, and S. Shenker.

Processing data streams with hard real-time constraints ...
data analysis, VoIP streaming, and sensor data processing .... AES framework is universally applicable to a large family ...... in such a dynamic environment.

STAGGER: Periodicity Mining of Data Streams ... - Semantic Scholar
proaches used for discovering periodicity rates, STAGGER not only discovers a wider, ... ∗Work done while at Department of Computer Sciences, Purdue Uni- versity ..... bounded by the buffer size allowed by the system for buffer- ing the data ...

Real-time RDF extraction from unstructured data streams - GitHub
May 9, 2013 - This results in a duplicate-free data stream ∆i. [k.d,(k+1)d] = {di ... The goal of this step is to find a suitable rdfs:range and rdfs:domain as well ..... resulted in a corpus, dubbed 100% of 38 time slices of 2 hours and 11.7 milli

Summarizing and Mining Skewed Data Streams
email streams [40], aggregating sensor data [39], analyzing .... The correlation is sufficiently good that not only ..... For z ≤ 1, the best results follow from analysis.

Wavelet Synopsis for Data Streams: Minimizing ... - Semantic Scholar
Aug 24, 2005 - cients are restricted to be wavelet coefficients of the data ..... values of the largest numbers in R are bounded if some πi's are very small.

Optimizing regression models for data streams with ...
Keywords data streams · missing data · linear models · online regression · regularized ..... 3 Theoretical analysis of prediction errors with missing data. We start ...

Optimizing regression models for data streams with ...
teria for building regression models robust to missing values, and a corresponding ... The goal is to build a predictive model, that may be continuously updated.

Adaptive Data Block Scheduling for Parallel TCP Streams
TCP [10], Scalable TCP [19], BIC-TCP [30], and Conges- tion Manager [7]. 4. ... turn causes the scheduling control system to reduce the number of packets ...... for high performance, wide-area distributed file downloads,”. Parallel Processing ...

Summarizing and Mining Skewed Data Streams - Semantic Scholar
SIAM Symposium on Discrete Algorithms, pages 623–632,. 2002. [7] J. Baumes, M. .... Empirically derived analytic models of wide-area. TCP connections.

Weighted similarity estimation in data streams
[29] A. Said, B. J. Jain, S. Albayrak. Analyzing weighting schemes in collaborative filtering: cold start, post cold start and power users. SAC 2012: 2035–2040.

Optimal Stochastic Policies for Distributed Data ... - RPI ECSE
for saving energy and reducing contentions for communi- ... for communication resources. ... alternatives to the optimal policy and the performance loss can.

Optimal Stochastic Policies for Distributed Data ... - RPI ECSE
Aggregation in Wireless Sensor Networks ... Markov decision processes, wireless sensor networks. ...... Technology Institute, Information and Decision Sup-.