Creating Probabilistic Databases from Imprecise Time-Series Data Saket Sathe, Hoyoung Jeung, Karl Aberer EPFL, Switzerland 13th April, 2011

S. Sathe, H. Jeung, K. Aberer (2011)

EPFL, Switzerland

1 / 15

Outline Probability distribution p(R) showing Alice’s position y

raw_values

time

x

y

1 2 : :

1.1 1.3 : :

2.3 2.1 : :

room 2

room 1

room 3

S. Sathe, H. Jeung, K. Aberer (2011)

room 4

μ

room 4

prob_view

time room probability

time = 1 3σ area as a reasonable boundary y room 1 time = 2

?

1 1 1 1 2 2 2 2

x room 2

1 2 3 4 1 2 3 4

0.5 0.1 0.3 0.1 0.2 0.4 0.1 0.3

p(R) dR

x

room4 ∩ 3σ area

EPFL, Switzerland

2 / 15

Outline raw_values

Probability distribution p(R) showing Alice’s position y

time

x

y

1 2 : :

1.1 1.3 : :

2.3 2.1 : :

room 2

room 1

room 3

room 4

μ

room 4

Dynamic Density Metrics

1 1 1 1 2 2 2 2

x room 2

1 2 3 4 1 2 3 4

0.5 0.1 0.3 0.1 0.2 0.4 0.1 0.3

p(R) dR

x

room4 ∩ 3σ area

Approximating Gaussian distributions using σ–cache

Measure of Quality Efficiently creating probabilistic views S. Sathe, H. Jeung, K. Aberer (2011)

prob_view

time room probability

time = 1 3σ area as a reasonable boundary y room 1 time = 2

?

Parameter setting under provable guarantees Experiments

EPFL, Switzerland

2 / 15

Problem Setting ˆ t2

values

S tH1

rt

pt(Rt )

rt-1 t-H-1

rt t-1

t

time

Dynamic Density Metric

alues

H , the dynamic density metric infers time-dependent probability Given St−1 rˆt) ) t= distributions pt (Rt ) at time t, H where Rtpis associated rt R t(Rat )random variable p t( (R t = S t 1 with rt . t p rˆt u

S. Sathe, H. Jeung, K. Aberer (2011)

EPFL, Switzerland

3 / 15

t-H-1

t-1

time

t

GARCH Metric pt(Rt ) ~ N(rˆt ,σˆ t ) 2

rˆt

values

S tH1

rˆ t) t= R ( pt r t) t= R ( pt

rt t-H-1

t-1

t

time

rˆt is modeled using an ARMA model σ ˆt2 is modeled using a GARCH model Thus pt (Rt ) is a N (ˆ rt , σ ˆt2 ). We refer to this approach as ARMA-GARCH

S. Sathe, H. Jeung, K. Aberer (2011)

EPFL, Switzerland

4 / 15

Quality of Dynamic Density Metrics

ARMA-GARCH Uniform Thresholding (UT) Variable Thresholding (VT) Kalman-GARCH

S. Sathe, H. Jeung, K. Aberer (2011)

ˆ rt ARMA ARMA ARMA Kalman Filter

EPFL, Switzerland

σ ˆt2 GARCH u (user-specified) H sample variance of St−1 GARCH

5 / 15

Quality of Dynamic Density Metrics

ARMA-GARCH Uniform Thresholding (UT) Variable Thresholding (VT) Kalman-GARCH

ˆ rt ARMA ARMA ARMA Kalman Filter

σ ˆt2 GARCH u (user-specified) H sample variance of St−1 GARCH

Problem: The true density pˆt (Rt ) is not observable

S. Sathe, H. Jeung, K. Aberer (2011)

EPFL, Switzerland

5 / 15

Quality of Dynamic Density Metrics ARMA-GARCH Uniform Thresholding (UT) Variable Thresholding (VT) Kalman-GARCH

ˆ rt ARMA ARMA ARMA Kalman Filter

σ ˆt2 GARCH u (user-specified) H sample variance of St−1 GARCH

Indirect Method Suppose p1 (R1 ), . . . , pT (RT ) are the inferred densities and let zt = P (Rt ≤ rt ) then zt is uniformly distributed between (0, 1) when pt (Rt ) = pˆt (Rt ) [Deibold et. al.]. v u 1 uX d{U (z), Q (z)} = t (U (x) − Q (x))2 , (1) Z

Z

Z

Z

x=0

where UZ (z) is the ideal uniform cdf between (0, 1) and QZ (z) is the observed cdf of zt . We call d{UZ (z), QZ (z)} the density distance. S. Sathe, H. Jeung, K. Aberer (2011)

EPFL, Switzerland

5 / 15

Probabilistic View Generation CREATE VIEW prob_view AS DENSITY r OVER t OMEGA delta=2, n=2 FROM raw_values WHERE t >= 1 AND t <= 3 dynamic density metrics

r =10.2 t =2

sensor

Framework

Ω―View builder

t

r



σˆ

1 2 3 4

4.2 5.9 7.1 7.9

4.0 6.0 7.0 7.7

0.3 3.2 2.9 0.2

rt r1 r2

σ―cache

raw_values pt(Rt)

S. Sathe, H. Jeung, K. Aberer (2011)

iew ilistic v probab n query o ti ra e n e g

EPFL, Switzerland

r3

Ω

user

Λ

ω1 [2:4] 0.50 ω2 [0:2] 0.01 ω1 [4:6] 0.23 ω2 [2:4] 0.08 ω1 [5:7] 0.25 ω2 [3:5] 0.16 prob_view

6 / 15

Probabilistic View Generation CREATE VIEW prob_view AS DENSITY r OVER t OMEGA delta=2, n=2 FROM raw_values WHERE t >= 1 AND t <= 3 Problem: Large computational cost when time interval and n are large and ∆ is small (finer granularity)

dynamic density metrics

r =10.2 t =2

sensor

Framework

Ω―View builder

t

r



σˆ

1 2 3 4

4.2 5.9 7.1 7.9

4.0 6.0 7.0 7.7

0.3 3.2 2.9 0.2

rt r1 r2

σ―cache

raw_values pt(Rt)

S. Sathe, H. Jeung, K. Aberer (2011)

iew ilistic v probab n query o ti ra e gen

EPFL, Switzerland

r3

Ω

user

Λ

ω1 [2:4] 0.50 ω2 [0:2] 0.01 ω1 [4:6] 0.23 ω2 [2:4] 0.08 ω1 [5:7] 0.25 ω2 [3:5] 0.16 prob_view

7 / 15

Probabilistic View Generation CREATE VIEW prob_view AS DENSITY r OVER t OMEGA delta=2, n=2 FROM raw_values WHERE t >= 1 AND t <= 3 Idea: Cache and reuse computation of probability values from earlier times

dynamic density metrics

r =10.2 t =2

sensor

Framework

Ω―View builder

t

r



σˆ

1 2 3 4

4.2 5.9 7.1 7.9

4.0 6.0 7.0 7.7

0.3 3.2 2.9 0.2

rt r1 r2

σ―cache

raw_values pt(Rt)

S. Sathe, H. Jeung, K. Aberer (2011)

iew ilistic v probab n query o ti ra gene

EPFL, Switzerland

r3

Ω

user

Λ

ω1 [2:4] 0.50 ω2 [0:2] 0.01 ω1 [4:6] 0.23 ω2 [2:4] 0.08 ω1 [5:7] 0.25 ω2 [3:5] 0.16 prob_view

7 / 15

Constraint-Aware Caching

Given: pt (Rt ) and pt0 (Rt0 ) are Gaussian with (ˆ rt , σ ˆt2 ) and (ˆ rt0 , σ ˆt20 ) Aim: Approximate values of pt0 (Rt0 ) by pt (Rt ) when t0 > t

S. Sathe, H. Jeung, K. Aberer (2011)

EPFL, Switzerland

8 / 15

Constraint-Aware Caching

Given: pt (Rt ) and pt0 (Rt0 ) are Gaussian with (ˆ rt , σ ˆt2 ) and (ˆ rt0 , σ ˆt20 ) Aim: Approximate values of pt0 (Rt0 ) by pt (Rt ) when t0 > t

Distance constraint guarantees that the maximum approximation error is upper bounded by the distance constraint when the cache is used Memory constraint guarantees that the cache does not use more memory than that specified by the memory constraint

S. Sathe, H. Jeung, K. Aberer (2011)

EPFL, Switzerland

8 / 15

Constraint-Aware Caching

Given: pt (Rt ) and pt0 (Rt0 ) are Gaussian with (ˆ rt , σ ˆt2 ) and (ˆ rt0 , σ ˆt20 ) Aim: Approximate values of pt0 (Rt0 ) by pt (Rt ) when t0 > t

Distance constraint guarantees that the maximum approximation error is upper bounded by the distance constraint when the cache is used Memory constraint guarantees that the cache does not use more memory than that specified by the memory constraint

S. Sathe, H. Jeung, K. Aberer (2011)

EPFL, Switzerland

8 / 15

Constraint-Aware Caching Given: pt (Rt ) and pt0 (Rt0 ) are Gaussian with (ˆ rt , σ ˆt2 ) and (ˆ rt0 , σ ˆt20 ) Aim: Approximate values of pt0 (Rt0 ) by pt (Rt ) when t0 > t Distance constraint guarantees that the maximum approximation error is upper bounded by the distance constraint when the cache is used Memory constraint guarantees that the cache does not use more memory than that specified by the memory constraint ρλ remains unchanged Pt ( Rt ; rˆt , ˆ t ) 2

Pt ' ( Rt ' ; rˆt ' ,ˆ t ) 2

Δ a' b' rˆt' b'=rˆt'+(λ+1)Δ

ˆt' S. Sathe, H. Jeung, K. Aberer (2011)

a'=r +λΔ

Δ rˆt a=rˆt+λΔ

EPFL, Switzerland

ab b=rˆt+(λ+1)Δ 8 / 15

Guaranteeing Distance Constraint We use the Hellinger distance denoted H[·, ·] as a distance measure. 0 ≤ H ≤ 1.

Theorem: Distance Constraint Given a user-defined distance constraint H0 , we guarantee that ˆt0 ≤ ds · σ ˆt and σ ˆt0 > σ ˆt where the parameter ds is H[pt (Rt ), pt0 (Rt0 )] ≤ H0 , if σ chosen as any value satisfying, q 4 1 + 1 − 1 − H0 2 ds ≤ . 2 1 − H0 2 We call ds the ratio threshold. Example Suppose H0 = 0.2, then ds ≤ 1.5 Choose, say, ds = 1.4 then if S. Sathe, H. Jeung, K. Aberer (2011)

σ ˆ t0 σ ˆt

≤ ds then H [pt (Rt ), pt0 (Rt0 )] ≤ 0.2

EPFL, Switzerland

9 / 15

Initializing the σ–cache Let max(ˆ σt ) and min(ˆ σt ) be the maximum and minimum standard deviations observed in a probabilistic view generation query Compute Q, such that, max(ˆ σt ) = dQ σt ) s · min(ˆ dQe gives us the maximum number of distributions that we should cache

S. Sathe, H. Jeung, K. Aberer (2011)

EPFL, Switzerland

10 / 15

Initializing the σ–cache Let max(ˆ σt ) and min(ˆ σt ) be the maximum and minimum standard deviations observed in a probabilistic view generation query Compute Q, such that, max(ˆ σt ) = dQ σtˆ) s · min(ˆ dQe gives us the maximum number of distributions that we should cache

Q 

d s  min (ˆ t )

- cached values cache memory

d s  m i n (ˆ t ) d  m i n (ˆ t ) 2

n

1 s

Δ

Find dqs · min(ˆ σt ) such that dqs · min(ˆ σt ) ≤ σ ˆt0 < dq+1 · min(ˆ σt ) s S. Sathe, H. Jeung, K. Aberer (2011)

EPFL, Switzerland

10 / 15

σ–cache: Features

CREATE VIEW prob_view AS DENSITY r OVER t OMEGA delta=2, n=2 FROM raw_values WHERE t >= 1 AND t <= 3    σt ) The rate at which the memory requirement grows is O log max(ˆ min(ˆ σt ) The number of distributions cached does not depend on number of tuples that match the WHERE clause ∆ or n

S. Sathe, H. Jeung, K. Aberer (2011)

EPFL, Switzerland

11 / 15

Experimental Evaluation campus-data: ambient temperature values for over sixty five hours car-data: more than one hour of GPS data VT

ARMA-GARCH 3

2.5

2.5

density distance

density distance

UT 3

2 1.5 1 0.5 0

Kalman-GARCH

2 1.5 1 0.5 0

30

60

90

120

150

180

30

60

90

120

150

window size (H)

window size (H)

(a) campus-data

(b) car-data

180

ARMA-GARCH and Kalman-GARCH give upto 12 to 20 times lower density distances S. Sathe, H. Jeung, K. Aberer (2011)

EPFL, Switzerland

12 / 15

Experimental Evaluation naive

σ-cache

cache size (kilobytes)

time (milliseconds)

2400 2000 1600 1200 800 400 0

6000

10000

14000

18000

1150 1100 1050 1000 950 900 850 2000

(a) Efficiency

4000

8000

16000

max. ratio threshold (Ds)

database size (tuples)

(b) Scaling Characteristics

Using ∆ = 0.05, n = 300 and Hellinger distance H = 0.01 An order of magnitude improvement in performance! S. Sathe, H. Jeung, K. Aberer (2011)

EPFL, Switzerland

13 / 15

Conclusions

Proposed time-series based models can be used for creating probabilistic databases Introduced the concept of density distance for measuring quality Proved useful and practical guarantees for using the σ-cache Caching and reusing distributions significantly increases the efficiency of creating probabilistic databases

S. Sathe, H. Jeung, K. Aberer (2011)

EPFL, Switzerland

14 / 15

Thank You. Questions?

Saket Sathe [email protected]

S. Sathe, H. Jeung, K. Aberer (2011)

Hoyoung Jeung [email protected]

EPFL, Switzerland

Karl Aberer [email protected]

15 / 15

Creating Probabilistic Databases from Imprecise Time ... - Saket Sathe

Apr 13, 2011 - Page 1 ... Measure of Quality. Efficiently creating probabilistic views ... CREATE VIEW prob_view AS DENSITY r. OVER t OMEGA delta=2, n=2.

519KB Sizes 3 Downloads 181 Views

Recommend Documents

CONFERENCE: Creating Probabilistic Databases from ...
arbitrary time series, which can work in online as well as offline fashion. ... a lack of effective tools that are capable of creating such ... ICDE Conference 2011.

Effective Metadata Management in Federated Sensor ... - Saket Sathe
NASA-Cisco climate change monitoring platform. – $39 billion ... Server to query, fields to read, rate, query ... Monitoring large numbers of sensor data streams.

Effective Metadata Management in Federated Sensor ... - Saket Sathe
... Example – Planetary Skin. ▫ NASA-Cisco climate change monitoring platform ... become very large. – keyword search is not sufficient, visualization is important ...

Decision making with imprecise probabilistic information
Jan 28, 2004 - data are generated by another model that belongs to a vaguely specified ..... thought of as, for instance, taking the initial urn, duplicate it, and ...

Using OBDDs for Efficient Query Evaluation on Probabilistic Databases
a query q and a probabilistic database D, we construct in polynomial time an ... formation have, such as data cleaning, data integration, and scientific databases. ..... The VO-types of the variable orders of Fig. 3 are (X∗Y∗)∗ and X∗Y∗, re

Reading from SQL databases - GitHub
Description. odbcDriverConnect() Open a connection to an ODBC database. sqlQuery(). Submit a query to an ODBC database and return the results. sqlTables(). List Tables on an ODBC Connection. sqlFetch(). Read a table from an ODBC database into a data

TPRLM: Time-based Probabilistic Relational Location ...
based coverage area management with organized Informing Cells and time-based ... arrival rate of any node. The scheme ... consistent service for wireless networks increases with the increase in ... conclude with the future development scopes of our w

Attitude toward imprecise information
(ii) shrinking the probability–possibility set toward the mean value to a degree .... In our representation theorem, we use Gilboa and Schmeidler's axiom of ... At this stage, we simply remark that the notion we adopt of what ..... This introduces

High Utility Item Sets Mining From Transactional Databases - IJRIT
Mining high utility item sets from transactional databases means to discovery of ..... Here, in this system architecture User has to select the list of item sets and ...

Imprecise information and subjective belief
distinguish if the observed ambiguity aversion is due to objective degree of imprecision of information or to the decision maker's subjective interpretation of such imprecise informa- tion. As the leading case, consider the multiple-priors model (Gil

High Utility Item Sets Mining From Transactional Databases - IJRIT
Large Data Bases (VLDB), pp. 487-499, 1994. ... Proc. of International Conference on Knowledge Discovery and Data mining, 2003. [7] C. F. Ahmed, S. K. ...

Harvesting Large-Scale Weakly-Tagged Image Databases from the ...
tagged images from collaborative image tagging systems such as Flickr by ... (c) Spam Tags: Spam tags, which are used to drive traf- fic to certain images for fun or .... hard to use only one single type of kernel to characterize the diverse visual .

extracting news from server side databases by query ...
Keywords: Web-based Tools, Knowledge Acquisition, Web ... We can collect and analyze these data to acquire the desired information/ ...... analytical systems.

Date Depart from Depart time Destination Arrival time ... -
... can be obtained at the border crossing. Call Horacio with any questions +1-949-275-3380. Confirmed travelers: 1. Albert. 2. Lauren. 3. Alexandra. 4. Yujia. 5. Ehsan. 6. Marc. 7. Miodrag. 8. Rebecca. 9. Horacio. 10. Derosh. 11. Jufeng. 12. Michiel

man-136\cacar-mayer-sathe-chud-chudi-golopo.pdf
Page 2 of 5. PROGRAMACIÓ TRIMESTRAL Escola del Mar, curs 2017-18. 5è. 2. SEGON TRIMESTRE. Numeració i càlcul. - Nombres decimals: part sencera i ...

Time Out Australia grows revenue 75% while creating ...
on DoubleClick for Publishers (DFP) Small Business and DoubleClick's Ad. Exchange for operational efficiency and growing revenue. Time Out Australia earns ...

CREATING A REAL-TIME dashboard APP FOR TWITTER USING ...
When something could have changed, re-render everything to a new DOM-representation. 2.Diff new output with previous output. 3.Update only what has ...

CREATING A REAL-TIME dashboard APP FOR TWITTER USING ...
CREATING A REAL-TIME dashboard APP FOR TWITTER USING. REACT.JS. // Erik Wendel, BEKK Consulting .... but main goal = simplify development. 48 ...

Probabilistic Collocation - Jeroen Witteveen
Dec 23, 2005 - is compared with the Galerkin Polynomial Chaos method, the Non-Intrusive Polynomial. Chaos method ..... A second-order central finite volume ...

Decisions with conflicting and imprecise information
Feb 9, 2012 - Page 1 ... the decision maker has a prior distribution, and use experts' assessments to update this prior. This leads to ... However, to the best.

DIR-ST2: Delineation of Imprecise Regions Using ...
Experiment results show that by virtue of the significant noise reduction in the region, our DIR-ST2 ... to the existence of noise points that are dense enough to.