Ensemble Estimation of Multivariate ๐-Divergence Kevin Moon, Alfred O. Hero III University of Michigan, Department of Electrical Engineering and Computer Science
Introduction
Application to Divergence Estimation
Proofs of Theorems 2 and 3
๏ง ๐-divergence measures the difference between distributions
๏ง Setup: ๐ฟ1 , โฆ , ๐ฟ๐ , ๐ฟ๐+1 , โฆ , ๐ฟ๐+๐2 i.i.d. samples from ๐2 and ๐1 , โฆ , ๐๐1 i.i.d. samples from ๐1 , ๐ = ๐ + ๐2 . ๏ง Plug-in Estimators: the ๐-nn density estimate is
๏ง Principal tools are concentration inequalities and moment bounds applied to a higher order Taylor expansion of ๐๐ . ๏ง Due to dependence of ๐๐ on the likelihood ratio, bounds on covariances of products of the density estimates are derived.
โข Has the form G ๐1 , ๐2 =
๐1 ๐ฅ ๐2 ๐ฅ
๐
๐2 (๐ฅ) ๐๐ฅ, where ๐1 and ๐2 are
densities and ๐ is a smooth convex function โข Includes KL and Renyi-๐ผ divergences โข Important in statistics, machine learning, and information theory ๏ง Convergence rates are unknown for most divergence estimators ๏ง We derive the MSE convergence rates for kernel density plug-in ๐divergence estimators using ๐-nn kernel โข Assume densities are smooth and two finite populations of i.i.d. samples are available
f i ,k ( X j ) ๏ฝ
Weighted Ensemble Estimation Let ๐ = ๐1 , โฆ , ๐๐ฟ be a set of index values and ๐ฌ๐ ๐โ๐ be an ensemble of estimates of ๐ธ. The weighted ensemble estimator is ๐ฌ๐ค =
๐ค(๐)๐ฌ๐ , ๐โ๐
where ๐โ๐ ๐ค ๐ = 1. Consider the following conditions on ๐ฌ๐ ๏ง ๐ถ. 1 The bias is given by Bias(๐ฌ๐ ) =
๐ โ๐/2๐
๐๐ ๐๐ ๐
+๐
๐โ๐ฝ
1 ๐
: ๐โ๐
,
where ๐๐ are constants depending on the density, ๐ฝ is a finite index set with length< ๐ฟ, min J > 0 and max ๐ฝ โค ๐, ๐ is the dimension of the data, and ๐๐ ๐ are basis functions depending only on the parameter ๐. ๏ง ๐ถ. 2 The variance is given by Var ๐ฌ๐ = ๐๐ฃ
,
Experiments
1 Gk ๏ฝ N
๏ฆ f 1,k ( X i ) ๏ถ g๏ง ๏ท. ๏ฅ i ๏ฝ1 ๏จ f 2,k ( X i ) ๏ธ
๏ง Assumptions 1. ๐1 , ๐2 , and ๐ are smooth (๐ differentiable) 2. ๐1 and ๐2 have bounded support 3. ๐1 and ๐2 are strictly lower bounded 1/2 4. ๐1 = ๐ ๐2 and ๐ = ๐0 ๐2 ๏ง MSE Convergence Rates: under the above assumptions, Theorem 2. The bias of the plug-in estimator ๐๐ is
๏จ ๏ฉ
Bias G k
๐ค
๐ ๐ข๐๐๐๐๐ก ๐ก๐
๐ค ๐ = 1, ๐โ๐
๐พ๐ค ๐ =
๐ค ๐ ๐๐ (๐) = 0, ๐ โ ๐ฝ. ๐โ๐
We apply Theorem 1 to ๐ -divergence estimators to obtain the parametric rate of ๐ 1/๐ . To do this, we verify conditions ๐ถ. 1 and ๐ถ. 2.
Truncated Gaussians, ๐ = 5
j ๏ฆ ๏ถ d d ๏ฆ k ๏ถ ๏ท ๏ฆ1 k ๏ถ ๏ฆ1๏ถ ๏ง ๏ฝ ๏ฅO ๏ง ๏ซO๏ง ๏ท ๏ซ o๏ง ๏ซ . ๏ท ๏ท ๏ง ๏ท ๏จk๏ธ j ๏ฝ1 ๏ง ๏จ M 2 ๏ธ ๏ท ๏จ k M2 ๏ธ ๏จ ๏ธ
Theorem 3. The variance of the plug-in estimator ๐ฎ๐ is
๏จ ๏ฉ
Var G k
๏ฆ1 ๏ฆ1 1 ๏ถ 1 1๏ถ ๏ฝ O๏ง ๏ซ ๏ซ 2 ๏ท. ๏ท ๏ซ o๏ง ๏ซ ๏จ N M2 ๏ธ ๏จ N M2 k ๏ธ
Truncated Gaussians, ๐ = 3000
Non-truncated Gaussians, d=5
Conclusions ๏ง Derived MSE convergence rates for a plug-in estimator of ๐ divergence 1 ๏ง Obtained an estimator with convergence rate of ๐ by applying ๐
the theory of optimally weighted ensemble estimation โข Simple and performs well for higher dimension โข Performs well for densities with unbounded support
๏ฉ
where ๐ค0 is the solution to the following convex optimization problem: min ๐ค 2
๏ง Truncated Gaussian pdfs with different means and covariances ๏ง Estimate Renyi-๐ผ divergence with ๐ผ = 0.8 for 100 trials ๏ง Two experiments: Fixed ๐, increasing ๐; fixed ๐, increasing ๐
N
1 1 +๐ . ๐ ๐
Theorem 1. [2] Under the conditions ๐ถ. 1 and ๐ถ. 2, there exists a weight vector ๐ค0 such that ๏ฉ Ew ๏ญ E 2 ๏น ๏ฝ O ๏ฆ 1 ๏ถ , ๏ง ๏ท ๏ช๏ซ 0 ๏บ๏ป ๏จT ๏ธ
๏จ
M ic ๏ฒ ( j) d i ,k
where ๐ โค ๐๐ , ๐ is the volume of a ๐-dimensional unit ball, and ๐ ๐๐,๐ ๐ is the distance to the ๐ th nearest neighbor of ๐๐ in ๐1 , โฆ , ๐๐1 or ๐๐+1 , โฆ , ๐๐+๐2 for ๐ = 1,2, respectively. The divergence estimate is
1 ๐
๏ง We obtain an estimator with rate ๐ (๐ =sample size) by applying the theory of optimally weighted ensemble estimation โข More computationally tractable than competing estimators
k
Acknowledgments Heat map of predicted bias of non-averaged ๐-divergence estimator based on Theorem 2 as a function of dimension and sample size.
๏ง Weighted Ensemble Divergence Estimator: Choose ๐ฟ > ๐ โ 1 and choose ๐ = ๐1 , โฆ , ๐๐ฟ . Let ๐ ๐ = ๐ ๐2 and ๐๐ค = satisfies ๐โ๐ ๐ค(๐)๐๐(๐) . From Theorem 2, the bias of ๐๐(๐) ๐โ๐
๐ถ. 1 when ๐๐ ๐ = ๐ ๐/๐ and ๐ฝ = 1, โฆ , ๐ โ 1 . From Theorem 3, the general form of the variance of ๐๐(๐) also follows ๐ถ. 2. ๐โ๐
Theorem 1 then gives us the optimal weight to obtain a convergence rate of ๐ 1/๐ .
This work was partially supported by NSF grant CCF-1217880 and a NSF Graduate Research Fellowship to the first author under Grant No. F031543.
References [1] K. Moon, A. Hero, โEnsemble estimation of multivariate ๐divergence,โ Submitted to ISIT 2014. [2] K. Sricharan, D. Wei, and A. Hero, โEnsemble estimators for multivariate entropy estimation,โ IEEE Trans. on Info. Theory, vol. 59, no. 7, pp 4374-4388, 2013.