Creating Probabilistic Databases from Imprecise Time-Series Data Saket Sathe, Hoyoung Jeung, Karl Aberer EPFL, Switzerland 13th April, 2011
S. Sathe, H. Jeung, K. Aberer (2011)
EPFL, Switzerland
1 / 15
Outline Probability distribution p(R) showing Alice’s position y
raw_values
time
x
y
1 2 : :
1.1 1.3 : :
2.3 2.1 : :
room 2
room 1
room 3
S. Sathe, H. Jeung, K. Aberer (2011)
room 4
μ
room 4
prob_view
time room probability
time = 1 3σ area as a reasonable boundary y room 1 time = 2
?
1 1 1 1 2 2 2 2
x room 2
1 2 3 4 1 2 3 4
0.5 0.1 0.3 0.1 0.2 0.4 0.1 0.3
p(R) dR
x
room4 ∩ 3σ area
EPFL, Switzerland
2 / 15
Outline raw_values
Probability distribution p(R) showing Alice’s position y
time
x
y
1 2 : :
1.1 1.3 : :
2.3 2.1 : :
room 2
room 1
room 3
room 4
μ
room 4
Dynamic Density Metrics
1 1 1 1 2 2 2 2
x room 2
1 2 3 4 1 2 3 4
0.5 0.1 0.3 0.1 0.2 0.4 0.1 0.3
p(R) dR
x
room4 ∩ 3σ area
Approximating Gaussian distributions using σ–cache
Measure of Quality Efficiently creating probabilistic views S. Sathe, H. Jeung, K. Aberer (2011)
prob_view
time room probability
time = 1 3σ area as a reasonable boundary y room 1 time = 2
?
Parameter setting under provable guarantees Experiments
EPFL, Switzerland
2 / 15
Problem Setting ˆ t2
values
S tH1
rt
pt(Rt )
rt-1 t-H-1
rt t-1
t
time
Dynamic Density Metric
alues
H , the dynamic density metric infers time-dependent probability Given St−1 rˆt) ) t= distributions pt (Rt ) at time t, H where Rtpis associated rt R t(Rat )random variable p t( (R t = S t 1 with rt . t p rˆt u
S. Sathe, H. Jeung, K. Aberer (2011)
EPFL, Switzerland
3 / 15
t-H-1
t-1
time
t
GARCH Metric pt(Rt ) ~ N(rˆt ,σˆ t ) 2
rˆt
values
S tH1
rˆ t) t= R ( pt r t) t= R ( pt
rt t-H-1
t-1
t
time
rˆt is modeled using an ARMA model σ ˆt2 is modeled using a GARCH model Thus pt (Rt ) is a N (ˆ rt , σ ˆt2 ). We refer to this approach as ARMA-GARCH
S. Sathe, H. Jeung, K. Aberer (2011)
EPFL, Switzerland
4 / 15
Quality of Dynamic Density Metrics
ARMA-GARCH Uniform Thresholding (UT) Variable Thresholding (VT) Kalman-GARCH
S. Sathe, H. Jeung, K. Aberer (2011)
ˆ rt ARMA ARMA ARMA Kalman Filter
EPFL, Switzerland
σ ˆt2 GARCH u (user-specified) H sample variance of St−1 GARCH
5 / 15
Quality of Dynamic Density Metrics
ARMA-GARCH Uniform Thresholding (UT) Variable Thresholding (VT) Kalman-GARCH
ˆ rt ARMA ARMA ARMA Kalman Filter
σ ˆt2 GARCH u (user-specified) H sample variance of St−1 GARCH
Problem: The true density pˆt (Rt ) is not observable
S. Sathe, H. Jeung, K. Aberer (2011)
EPFL, Switzerland
5 / 15
Quality of Dynamic Density Metrics ARMA-GARCH Uniform Thresholding (UT) Variable Thresholding (VT) Kalman-GARCH
ˆ rt ARMA ARMA ARMA Kalman Filter
σ ˆt2 GARCH u (user-specified) H sample variance of St−1 GARCH
Indirect Method Suppose p1 (R1 ), . . . , pT (RT ) are the inferred densities and let zt = P (Rt ≤ rt ) then zt is uniformly distributed between (0, 1) when pt (Rt ) = pˆt (Rt ) [Deibold et. al.]. v u 1 uX d{U (z), Q (z)} = t (U (x) − Q (x))2 , (1) Z
Z
Z
Z
x=0
where UZ (z) is the ideal uniform cdf between (0, 1) and QZ (z) is the observed cdf of zt . We call d{UZ (z), QZ (z)} the density distance. S. Sathe, H. Jeung, K. Aberer (2011)
EPFL, Switzerland
5 / 15
Probabilistic View Generation CREATE VIEW prob_view AS DENSITY r OVER t OMEGA delta=2, n=2 FROM raw_values WHERE t >= 1 AND t <= 3 dynamic density metrics
r =10.2 t =2
sensor
Framework
Ω―View builder
t
r
rˆ
σˆ
1 2 3 4
4.2 5.9 7.1 7.9
4.0 6.0 7.0 7.7
0.3 3.2 2.9 0.2
rt r1 r2
σ―cache
raw_values pt(Rt)
S. Sathe, H. Jeung, K. Aberer (2011)
iew ilistic v probab n query o ti ra e n e g
EPFL, Switzerland
r3
Ω
user
Λ
ω1 [2:4] 0.50 ω2 [0:2] 0.01 ω1 [4:6] 0.23 ω2 [2:4] 0.08 ω1 [5:7] 0.25 ω2 [3:5] 0.16 prob_view
6 / 15
Probabilistic View Generation CREATE VIEW prob_view AS DENSITY r OVER t OMEGA delta=2, n=2 FROM raw_values WHERE t >= 1 AND t <= 3 Problem: Large computational cost when time interval and n are large and ∆ is small (finer granularity)
dynamic density metrics
r =10.2 t =2
sensor
Framework
Ω―View builder
t
r
rˆ
σˆ
1 2 3 4
4.2 5.9 7.1 7.9
4.0 6.0 7.0 7.7
0.3 3.2 2.9 0.2
rt r1 r2
σ―cache
raw_values pt(Rt)
S. Sathe, H. Jeung, K. Aberer (2011)
iew ilistic v probab n query o ti ra e gen
EPFL, Switzerland
r3
Ω
user
Λ
ω1 [2:4] 0.50 ω2 [0:2] 0.01 ω1 [4:6] 0.23 ω2 [2:4] 0.08 ω1 [5:7] 0.25 ω2 [3:5] 0.16 prob_view
7 / 15
Probabilistic View Generation CREATE VIEW prob_view AS DENSITY r OVER t OMEGA delta=2, n=2 FROM raw_values WHERE t >= 1 AND t <= 3 Idea: Cache and reuse computation of probability values from earlier times
dynamic density metrics
r =10.2 t =2
sensor
Framework
Ω―View builder
t
r
rˆ
σˆ
1 2 3 4
4.2 5.9 7.1 7.9
4.0 6.0 7.0 7.7
0.3 3.2 2.9 0.2
rt r1 r2
σ―cache
raw_values pt(Rt)
S. Sathe, H. Jeung, K. Aberer (2011)
iew ilistic v probab n query o ti ra gene
EPFL, Switzerland
r3
Ω
user
Λ
ω1 [2:4] 0.50 ω2 [0:2] 0.01 ω1 [4:6] 0.23 ω2 [2:4] 0.08 ω1 [5:7] 0.25 ω2 [3:5] 0.16 prob_view
7 / 15
Constraint-Aware Caching
Given: pt (Rt ) and pt0 (Rt0 ) are Gaussian with (ˆ rt , σ ˆt2 ) and (ˆ rt0 , σ ˆt20 ) Aim: Approximate values of pt0 (Rt0 ) by pt (Rt ) when t0 > t
S. Sathe, H. Jeung, K. Aberer (2011)
EPFL, Switzerland
8 / 15
Constraint-Aware Caching
Given: pt (Rt ) and pt0 (Rt0 ) are Gaussian with (ˆ rt , σ ˆt2 ) and (ˆ rt0 , σ ˆt20 ) Aim: Approximate values of pt0 (Rt0 ) by pt (Rt ) when t0 > t
Distance constraint guarantees that the maximum approximation error is upper bounded by the distance constraint when the cache is used Memory constraint guarantees that the cache does not use more memory than that specified by the memory constraint
S. Sathe, H. Jeung, K. Aberer (2011)
EPFL, Switzerland
8 / 15
Constraint-Aware Caching
Given: pt (Rt ) and pt0 (Rt0 ) are Gaussian with (ˆ rt , σ ˆt2 ) and (ˆ rt0 , σ ˆt20 ) Aim: Approximate values of pt0 (Rt0 ) by pt (Rt ) when t0 > t
Distance constraint guarantees that the maximum approximation error is upper bounded by the distance constraint when the cache is used Memory constraint guarantees that the cache does not use more memory than that specified by the memory constraint
S. Sathe, H. Jeung, K. Aberer (2011)
EPFL, Switzerland
8 / 15
Constraint-Aware Caching Given: pt (Rt ) and pt0 (Rt0 ) are Gaussian with (ˆ rt , σ ˆt2 ) and (ˆ rt0 , σ ˆt20 ) Aim: Approximate values of pt0 (Rt0 ) by pt (Rt ) when t0 > t Distance constraint guarantees that the maximum approximation error is upper bounded by the distance constraint when the cache is used Memory constraint guarantees that the cache does not use more memory than that specified by the memory constraint ρλ remains unchanged Pt ( Rt ; rˆt , ˆ t ) 2
Pt ' ( Rt ' ; rˆt ' ,ˆ t ) 2
Δ a' b' rˆt' b'=rˆt'+(λ+1)Δ
ˆt' S. Sathe, H. Jeung, K. Aberer (2011)
a'=r +λΔ
Δ rˆt a=rˆt+λΔ
EPFL, Switzerland
ab b=rˆt+(λ+1)Δ 8 / 15
Guaranteeing Distance Constraint We use the Hellinger distance denoted H[·, ·] as a distance measure. 0 ≤ H ≤ 1.
Theorem: Distance Constraint Given a user-defined distance constraint H0 , we guarantee that ˆt0 ≤ ds · σ ˆt and σ ˆt0 > σ ˆt where the parameter ds is H[pt (Rt ), pt0 (Rt0 )] ≤ H0 , if σ chosen as any value satisfying, q 4 1 + 1 − 1 − H0 2 ds ≤ . 2 1 − H0 2 We call ds the ratio threshold. Example Suppose H0 = 0.2, then ds ≤ 1.5 Choose, say, ds = 1.4 then if S. Sathe, H. Jeung, K. Aberer (2011)
σ ˆ t0 σ ˆt
≤ ds then H [pt (Rt ), pt0 (Rt0 )] ≤ 0.2
EPFL, Switzerland
9 / 15
Initializing the σ–cache Let max(ˆ σt ) and min(ˆ σt ) be the maximum and minimum standard deviations observed in a probabilistic view generation query Compute Q, such that, max(ˆ σt ) = dQ σt ) s · min(ˆ dQe gives us the maximum number of distributions that we should cache
S. Sathe, H. Jeung, K. Aberer (2011)
EPFL, Switzerland
10 / 15
Initializing the σ–cache Let max(ˆ σt ) and min(ˆ σt ) be the maximum and minimum standard deviations observed in a probabilistic view generation query Compute Q, such that, max(ˆ σt ) = dQ σtˆ) s · min(ˆ dQe gives us the maximum number of distributions that we should cache
Q
d s min (ˆ t )
- cached values cache memory
d s m i n (ˆ t ) d m i n (ˆ t ) 2
n
1 s
Δ
Find dqs · min(ˆ σt ) such that dqs · min(ˆ σt ) ≤ σ ˆt0 < dq+1 · min(ˆ σt ) s S. Sathe, H. Jeung, K. Aberer (2011)
EPFL, Switzerland
10 / 15
σ–cache: Features
CREATE VIEW prob_view AS DENSITY r OVER t OMEGA delta=2, n=2 FROM raw_values WHERE t >= 1 AND t <= 3 σt ) The rate at which the memory requirement grows is O log max(ˆ min(ˆ σt ) The number of distributions cached does not depend on number of tuples that match the WHERE clause ∆ or n
S. Sathe, H. Jeung, K. Aberer (2011)
EPFL, Switzerland
11 / 15
Experimental Evaluation campus-data: ambient temperature values for over sixty five hours car-data: more than one hour of GPS data VT
ARMA-GARCH 3
2.5
2.5
density distance
density distance
UT 3
2 1.5 1 0.5 0
Kalman-GARCH
2 1.5 1 0.5 0
30
60
90
120
150
180
30
60
90
120
150
window size (H)
window size (H)
(a) campus-data
(b) car-data
180
ARMA-GARCH and Kalman-GARCH give upto 12 to 20 times lower density distances S. Sathe, H. Jeung, K. Aberer (2011)
EPFL, Switzerland
12 / 15
Experimental Evaluation naive
σ-cache
cache size (kilobytes)
time (milliseconds)
2400 2000 1600 1200 800 400 0
6000
10000
14000
18000
1150 1100 1050 1000 950 900 850 2000
(a) Efficiency
4000
8000
16000
max. ratio threshold (Ds)
database size (tuples)
(b) Scaling Characteristics
Using ∆ = 0.05, n = 300 and Hellinger distance H = 0.01 An order of magnitude improvement in performance! S. Sathe, H. Jeung, K. Aberer (2011)
EPFL, Switzerland
13 / 15
Conclusions
Proposed time-series based models can be used for creating probabilistic databases Introduced the concept of density distance for measuring quality Proved useful and practical guarantees for using the σ-cache Caching and reusing distributions significantly increases the efficiency of creating probabilistic databases
S. Sathe, H. Jeung, K. Aberer (2011)
EPFL, Switzerland
14 / 15
Thank You. Questions?
Saket Sathe
[email protected]
S. Sathe, H. Jeung, K. Aberer (2011)
Hoyoung Jeung
[email protected]
EPFL, Switzerland
Karl Aberer
[email protected]
15 / 15