On Weight Ratio Estimation for Covariate Shift - Carnegie Mellon ...

Viewer
Transcript

On Weight Ratio Estimation for Covariate Shift

Ruth Urner Department of Empirical Inference MPI for Intelligent Systems T¨ubingen, Germany [email protected]

Shai Ben-David Cheriton School of Computer Science University of Waterloo Waterloo, Canada [email protected]

Abstract Covariate shift is a common assumption for various Domain Adaptation (DA) tasks. Many of the common DA algorithms for that setup rely on density ratio estimation applied to the source training sample. In this work, we analyze the sample complexity of reliable density ratio estimation and its relation to classification under covariate shift. We provide a strong lower bound on the number of samples needed, even for an extremely simple version of the problem. Aiming to shed light on the practical success of reweighing paradigms, we present a novel reweighing scheme for which we prove finite sample size success guarantees under some natural conditions for the learning problems involved. Notably, our reweighing algorithm for learning under covariate shift, does not rely on first approximating the density ratio of the two distributions.

1

Introduction

Sample reweighing (also known as importance weighing or density ratio estimation) is a paradigm that plays a major role in domain adaptation learning under the covariate shift assumption ([1], [2], [3]). Indeed, provided that the covariate shift assumption holds, when the source marginal distribution is absolutely continuous w.r.t. the target marginal, it can be readily shown that reweighing the source labeled sample so that each sample points gets a weight proportional to the ratio of its target density to its source density, turns it into a sample that can be used as if it were target generated [4]. This approach therefore provides a full solution to the domain adaptation learning problem once the density ratio between the source and target distributions can be reliably evaluated at the sample points. However, such knowledge is rare. Instead, what a learner typically has access to are finite samples generated by those distributions. Consequently, algorithms that employ sample reweighing rely on estimating those ratios from finite source and target samples. Many sample based reweighing algorithms have been proposed and are repeatedly reported to perform well on domain adaptation and other learning tasks [5, 2, 6, 3]. However, theoretical analysis of such estimation algorithms has not yielded finite sample size quality of approximation guarantees (also see the related work in Appendix Section A). Our first result is a strong lower bound. We show that even a very simple version of the weight ratio estimation problem already requires unreasonably large samples sizes. More specifically, we show that even distinguishing between the cases that two distributions are identical or very different in terms of their density ratio, requires sample sizes in the order of the square root of the domain size. The main contribution of this work is on the positive side, though. We propose a novel reweighing scheme for learning under covariate shift and show that it enjoys strong performance guarantees. We prove finite sample error bounds for our algorithm under some natural niceness assumptions about the relevant distributions, that are likely to hold for many practical learning tasks. The properties that we rely on are a relaxed version of Lipschitzness of the labeling that takes the marginal distribution 1

into account, and a novel notion of simplicity of the marginal distribution, called concentratedness, which reflects some structure likely to exist in many real data and may prove useful for other learning tasks. It is worth noting that our reweighing algorithm does not rely on first approximating the density ratio between the two distributions involved. Notation We employ standard notation, see Appendix, Section B, for more details. We let P S and P T be two distributions over a labeled domain X × {0, 1}. We denote their marginal distributions over X by PXS and PXT respectively. We assume covariate shift, that is we assume that P S and P T have the same labeling function l : X → {0, 1}. A function h : X → {0, 1} is a classifier and a set of classifiers H ⊆ {0, 1}X a hypothesis class. We let errP (h) = Pr(x,y)∼P [h(x) 6= y] denote the error of classifier h with respect to the distribution P . An empirical error with respect to a sample S is denoted by err c S (h). We let the best achievable error by hypothesis class H with respect to distribution P be denoted by optP (H) = inf h∈H errP (h).

2

The difficulty of density ratio estimation

Density ratio estimation has been advocated as a key tool for domain adaptation in the covariate shift setup ([5, 2, 6, 3]). However, note that density ratio estimation is at least as difficult a task as density estimation (just consider the case where one of the densities is a constant function). Furthermore, we derive a strong lower bound even for a particularly simple version of density ratio estimation. Our lower bound implies that this task cannot be reliably achieved with finite, task-independent sample sizes as long as no further assumptions about the densities involved are made. It is shown via a reduction from hardness results for domain adaptation which appeared in [7]. For more details on this reduction and our lower bound, see Appendix, Section C. Theorem 1. Given a finite set X , no algorithm that is based on samples from the two distributions can distinguish between the case that two distributions over X have density ratio 1 for all x ∈ X and the case where they have p density ratio 0 for all x ∈ X , as long as the sizes of those samples sum up to less than Ω |X | .

3

A new reweighing algorithm

We introduce a DA learning that is based on a novel reweighing mechanism, ReWeigh(S, T, , B) (where S is a source sample, T is target unlabeled sample, some accuracy parameter and B a partition of the domain set). Our learning algorithm uses that to reweigh a labeled sample S from the source according to an (unlabeled) sample T from the target. Then it chooses a weighted ERM classifier among the members of H that satisfy some margin requirement (see definition below for the notion margin we use). Algorithm Learn Input Finite sets S, T ⊆ X , partition B, accuracy , margin ρ Step 1 Get weights for S by ReWeigh(S, T, 3 4 , B) b = {h ∈ H : h has a (ρ, 0) margin w.r.t. w(S)} Step 2 H Output ERMHb (w(S)) Empirical Risk minimization (ERM) can be readily generalized to a weighted sample S as follows: Let w(S) = ((x1 , y1 , w1 ), . . . , (xn , yn , wn )) be a sample where wi is a weight associated with example (xi , yi ). Then we define the weighted empirical error with respect to the weighted sample as  −1   X X err c w(S) (h) =  wi   wi 1[h(xi ) 6= yi ] , (xi ,yi ,wi )∈S

(xi ,yi ,wi )∈S

and ERMH (S) is defined as ERMH (w(S)) ∈ argminh∈H err c w(S) (h) 2

For our reweighing scheme, assume that B is a partition of the space X into sets of diameter at most ρ. For (sample) points s and t we let bs ∈ B and bt ∈ B denote the cells of the partition that contains s and t respectively. Given samples S and T , a parameter η and partition B, our reweighing procedure focuses on cells that are η-heavy from the point of view of sample T . Let Tη be the set of all elements of T that reside in such η-heavy cells. We focus on the set of heavy cells, and the subset, Tη , of the sample . Let w(B,T,η) (S) be the reweighing of S that assigns to every s ∈ S the weight w(s) =

|T ∩bs | |T ||S∩bs |

if

|T ∩bs | |T |

≥ η and 0 otherwise. Namely, this procedure uniformly divides the empirical T weight of every cell among all its S members, after T is restricted to the η-heavy cells that are occupied by members of S. Algorithm ReWeigh Input Finite sets S, T ⊆ X , parameter η, partition B |T ∩bs | |T ∩bs | |T ||S∩bs | if |T | ≥ η Step 1 For all s ∈ S set w(s) = 0 otherwise Output The weighted set wB,T,η (S) = w(S) = {(s, w(s)) : s ∈ S}

4

Distributional assumptions

In this section, we discuss the assumptions on the data generation underlying our analysis. Note that, by our lower bound, no positive results are possible in a distribution-free framework, i.e. without such assumptions. Our work can therefore also be viewed as identifying a first set of properties of the data generation that allow for domain adaptation learning based on sample reweighing. Weight ratio A common way of restricting the divergence of source and target weights marginal distributions is to assume some non-zero lower bound on the density ratio between the two distributions. The strongest such assumption (which is nevertheless often employed) is a bound on the pointwise weight-ratio. However, this is rather unrealistic [1]. The following relaxation of a density ratio, the η-weight ratio, has been introduced in [8]. Definition (Weight ratio). Let B ⊆ 2X be a collection of subsets of the domain X measurable with respect to both PXS and PXT . For some η > 0 we define the η-weight ratio of the source distribution and the target distribution with respect to B as PXS (b) CB,η (PXS , PXT ) = inf , b∈B PXT (b) T PX (b)≥η

This quantity become relevant for domain adaptation when bounded away from zero. Note that the pointwise weight ratio mentioned above can be obtained by setting B = {{x} : x ∈ X }. Distribution concentratedness We introduce an abstract notion of niceness of probability distributions that measures how “concentrated” the distribution is. Similar to notions of intrinsic dimension, it reflects the intuition of the distribution not occupying the full high dimensional space (or unit cube, or ball) over which it is defined. However, while common notions of intrinsic dimension focus on having only a few degrees of freedom in the distribution’s support, or being coverable by a small number of balls (as a function of their radius), we consider a different aspect - being coverable by dense subsets. Concentratedness states that most of the domain (with respect to a distribution) can be covered by relatively heavy sets of a bounded diameter. That is, intuitively, the distribution can not “spread very thinly over a large area”. For example, a mixture of well separated, full dimensional Gaussians is a distribution that has high dimension when viewed from the perspective of common intrinsic dimensions, but is simple with respect to our notion. Definition (Concentratedness). Let X be a subset of some Euclidean space, say [0, 1]d , and B ⊆ 2X a collection of subsets of the domain, usually of some small diameter, that covers X . We say that a probability distribution P over X is (α, β) concentrated with respect to B if h[ i P {b ∈ B : PX (b) ≥ α} > 1 − β 3

When B is the set of all balls of radius ρ, we say that P is (ρ, α, β) concentrated. Note that we do not care how many sets from B are required for such a majority cover. Our notion becomes particularly useful for classification prediction when paired with some Lipschitzness assumption governing the labeling rule. This is the way we will be applying it here. Probabilistic Lipschitzness Probabilistic Lipschitzness (PL), is a relaxation of standard Lipschitzness, introduced in [9]. Loosely speaking, for PL, we require Lipschitzness to only hold with some (high) probability. A Lipschitz constant λ for a distribution with deterministic labeling function forces a 1/λ gap between differently labeled points. Thus, the standard Lipschitz condition for deterministic labeling functions implies that the data lies in label homogeneous regions (clusters) that are separated by 1/λ-margins of weight zero with respect to the distribution. PL weakens this assumptions by allowing the margins to “smoothen out”. The relaxation from Lipschitzness to Probabilistic Lipschitzness is thus especially relevant to the deterministic labeling regime. It allows to model the marginal-label relatedness without trivializing the setup. Definition (Probabilistic Lipschitzness). Let X be some Euclidean domain and let φ : R → [0, 1]. We say that f : X → R is φ-Lipschitz with respect to a distribution PX over X if, for all λ > 0: Pr Pr [ |f (x) − f (y)| > 1/λ kx − yk ] > 0 ≤ φ(λ) x∼PX

y∼PX

If, for a distribution P = (PX , l), the labeling function l is φ-Lipschitz, then we also say that P satisfies φ-Probabilistic Lipschitzness. We then let φ−1 () denote the smallest λ such that φ(λ) ≥ . Classifiers with margins If a distribution P = (PX , l) with a deterministic labeling function l is φ-Lipschitz, then the weight of points x that have a label-heterogeneous λ-ball around them, is bounded by φ(λ). In other words, l has a (λ, φ−1 (λ/2)) margin with the following notion of margin: Definition (γ, margin classifier). Given a probability distribution P over X and parameters γ, ∈ (0, 1), We say that a classifier h : X → {0, 1} has margin (γ, ) with respect to P , if Pr [{x ∈ X : ∃y ∈ X such that ||x − y|| ≤ 2γ and h(x) 6= h(y)}] < . x∼PX

P For a class H, let Hγ, denote the class of all h in H that have a (γ, ) margin with respect to P .

5

Finite sample error bound

We prove that under the above assumptions about the distributions in a class of pairs of distributions W, our proposed algorithm is a successful DA learner for W. See Appendix, Section D, for details on the proof. Theorem 2. Let X be some domain and let H be a class over X of finite VC dimension. Let φ : [0, 1] → [0, 1]. For > 0, let ρ = φ−1 (/2) and let B be a partition of X into disjoint subsets each of diameter at most ρ. Further, for some constant, C, let W = W(H, , ρ, C) be the class of all pairs of probability distributions, (P S , P T ) satisfying φ-Probabilistic Lipschitzness, the covariate shift assumption as well as S • (Concentration of the marginal) P T [ {b ∈ B : P T (b) ≥ /2}] > 1 − . • (Margin w.r.t. H) There exist some h∗ ∈ H that has (2ρ, /2)-margin w.r.t. PXT and errP T (h∗ ) ≤ optP T (H) + . • (Weight-ratio for cells) CB,/4 (P S , P T ) ≥ C. Let η = 3 b (w(T,B,η) (S)) is an 3-successful 4 . Then there exists constants C1 , C2 such that ERMH DA learner for W, for sample sizes VC(H∆H) log(d/C) + log(1/δ) |S| ≥ C1 C and VC(H∆H) + log(1/δ) |T | ≥ C2 , 4 where H∆H = {h1 ∆h2 | h1 , h2 ∈ H}, and h1 ∆h2 = {x ∈ X | h1 (x) 6= h2 (x)}. 4

References [1] Corinna Cortes, Yishay Mansour, and Mehryar Mohri. Learning bounds for importance weighting. In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R.S. Zemel, and A. Culotta, editors, Advances in Neural Information Processing Systems (NIPS), pages 442–450. 2010. [2] M. Sugiyama, M. Krauledat, and K.-R. Muller. Covariate shift adaptation by importance weighted cross validation. Journal of Machine Learning Research, 8:985–1005, 2007. [3] Y. Tsuboi, Kashima, Hido S. H., S. Bickel, and M Sugiyama. Direct density ratio estimation for large-scale covariate shift adaptation. Journal of Information Processing, 17:138–155, 2009. [4] Corinna Cortes, Mehryar Mohri, Michael Riley, and Afshin Rostamizadeh. Sample selection bias correction theory. In Proceedings of the Conference on Algorithmic Learning Theory (ALT), pages 38–53, 2008. [5] M. Sugiyama and K. Mueller. Generalization error estimation under covariate shift. In Workshop on Information-Based Induction Sciences, 2005. [6] Masashi Sugiyama, Taiji Suzuki, Shinichi Nakajima, Hisashi Kashima, Paul von B¨unau, and Motoaki Kawanabe. Direct importance estimation for covariate shift adaptation. Annals of the Institute of Statistical Mathematics, 60(4):699–746, 2008. [7] Shai Ben-David and Ruth Urner. On the hardness of domain adaptation and the utility of unlabeled target samples. In Proceedings of the Conference on Algorithmic Learning Theory (ALT), pages 139–153, 2012. [8] Shai Ben-David, Shai Shalev-Shwartz, and Ruth Urner. Domain adaptation–can quantity compensate for quality? In International Symposium on Artificial Intelligence and Mathematics (ISAIM), 2012. [9] Ruth Urner, Shai Ben-David, and Shai Shalev-Shwartz. Unlabeled data can speed up prediction time. In Proceedings of the International Conference on Machine Learning (ICML), pages 641–648, 2011. [10] Takafumi Kanamori and Masashi Sugiyama. Statistical analysis of distance estimators with density differences and density ratios. Entropy, 16(2):921–942, 2014. [11] Qichao Que and Mikhail Belkin. Inverse density as an inverse problem: the fredholm equation approach. In Advances in Neural Information Processing Systems (NIPS), pages 1484–1492, 2013. [12] Assaf Glazer, Michael Lindenbaum, and Shaul Markovitch. Learning high-density regions for a generalized kolmogorov-smirnov test in high-dimensional data. In Advances in Neural Information Processing Systems (NIPS), pages 737–745, 2012. [13] Benjamin G. Kelly, Thitidej Tularak, Aaron B. Wagner, and Pramod Viswanath. Universal hypothesis testing in the learning-limited regime. In IEEE International Symposium on Information Theory (ISIT), pages 1478–1482, 2010. [14] D Haussler and E Welzl. Epsilon-nets and simplex range queries. In Proceedings of the second annual symposium on Computational geometry, SCG ’86, pages 61–71, 1986. [15] Shai Shalev-Shwartz and Shai Ben-David. Understanding Machine Learning. 2014. Cambridge University Press. [16] Vladimir N. Vapnik and Alexey J. Chervonenkis. On the uniform convergence of relative frequencies of events to their probabilities. Theory of Probability & Its Applications, 16(2):264– 280, 1971.

5

A

Related work

Due to the wide range of problems to which sample based importance reweighing is relevant, there is a large body of previous work addressing this topic. We aim to refer to some of the most relevant works below. However, there are some high level differences between the previously published papers that we are aware of and our work. In that respect, we believe the particular works we address below are representative of other papers on the topic as well. The first point to emphasize is that in this paper we are focusing on finite sample size guarantees. Furthermore, we are interested in sample size guaranties that are distribution free, or at least hold for a large non-parameterized family of distributions, and are independent of the size of the domain (or distribution support). Naturally, much of the work on this topic is aiming to provide practical algorithms supported by experimental results on some concrete data, rather than provide performance guarantees. This is, for example, the case with [5, 2, 6, 3]. Another relevant line of work aims to estimate the function f (x) = p(x)/q(x) (where p and q are the density functions of two distributions). Examples of such works are [10] and [11]. The difference between this line of work and ours is that it focuses on average norms, l1 and l2 , of the difference between that function and its estimate by the algorithm. This measure is too crude for implying domain adaptation error guarantees as in the setting discussed here. [4] discuss the effect of bias in the sample weighting estimation on the accuracy of hypotheses returned by a learning algorithm that is based on those reweighed samples. However, the accuracy guarantees (Theorem 2 in that work) grow unboundedly with the parameter B = ˆ which, in turn, grows unboundedly with the domain size. In contrast, maxx∈S (1/p(x), 1/p(x)) our focus is on error guarantees that are independent of the domain size (and thus also apply to probability distributions with infinite support). Lower bounds for domain adaptation have appeared in [7]. In fact, our lower bound for weight ratio estimation is based on a reduction from the latter lower bound for domain adaptation. Another interesting work is [12], focusing on the closely related two-sample problem. However, the theoretical bounds that they provide (Theorem 2 there) are only asymptotic in the sample sizes.

B

Notation

We use the following notation: We let X be some domain set and {0, 1} denote the label set. Let P S and P T be two distributions over X × {0, 1}. We call P S the source distribution and P T the target distribution. We denote the marginal distribution of P S over X by PXS and the marginal of P T by PXT , and their labeling functions by lS : X → {0, 1} and lT : X → {0, 1}, respectively (where, for a probability distribution P over X × {0, 1}, the associated labeling function is the conditional probability of label 1 at any given point: l(x) = Pr(X,Y )∼P (Y = 1|X = x)). A function h : X → {0, 1} is a classifier and a set of classifiers H ⊆ {0, 1}X a hypothesis class. In this work, we analyze learning with respect to the binary loss (0/1-loss). We let errP (h) = Pr(x,y)∼P [h(x) 6= y] denote the error of classifier h with respect to the distribution P . An empirical error with respect to a sample S is denoted by err c S (h). We let the best achievable error by hypothesis class H with respect to distribution P be denoted by optP (H) = inf h∈H errP (h). Empirical Risk minimization the errors on a samples S:

An Empirical Risk Minimizer is a classifier in H that minimizes

ERMH (S) ∈ argminh∈H err c S (h) It is straight forward to generalize to a weighted sample S. Let S = ((x1 , y1 , w1 ), . . . , (xn , yn , wn )) be a sample where wi is a weight associated with example (xi , yi ). Then we define the weighted empirical error with respect to the weighted sample as  −1   X X err c w(S) (h) =  wi   wi 1[h(xi ) 6= yi ] , (xi ,yi ,wi )∈S

(xi ,yi ,wi )∈S

and ERMH (S) is defined as above. 6

Domain Adaptation learnability A Domain Adaptation learner (DA learner) takes as input a labeled i.i.d. sample S drawn according to P S and an unlabeled i.i.d. sample T drawn according to PXT and aims to generate a good label predictor h : X → {0, 1} for P T . Formally, a DA learner is a function ∞ [ ∞ [ A: ((X × {0, 1})m × X n ) → {0, 1}X . m=1 n=1

Clearly, the success of domain adaptation learning cannot be achieved for every source-target pair of learning tasks. Therefore, we state the definition of successful learning in relation to a restricted class of pairs of distributions. Definition (DA Learnability). Let X be some domain, W a class of pairs (P S , P T ) of distributions over X × {0, 1}, H ⊆ {0, 1}X a hypothesis class and A a DA learner. We say that A solves DA for H with respect to the class W, if there exists functions m : (0, 1) × (0, 1) → N and n : (0, 1) × (0, 1) → N such that for all pairs (P S , P T ) ∈ W, for all > 0 and δ > 0, when given access to a labeled sample S of size m(, δ), generated i.i.d. by P S , and an unlabeled sample T of size n(, δ), generated i.i.d by PXT , then, with probability at least 1 − δ (over the choice of the samples S and T ) A outputs a function h with errP T (h) ≤ optP T (H) + . For s ≥ m(, δ) and t ≥ n(, δ), we also say that the learner A (, δ, s, t)-solves DA for H with respect to the class W. We are interested in finding pairs of functions m and n for the labeled source and unlabeled target samples sizes respectively, that satisfy the definitions of DA learnability for some DA learner A. Covariate shift The first property we introduce is often assumed in domain adaptation analysis (for example by [5]). In this work, we assume this property throughout. Definition (Covariate shift). We say that source and target distribution satisfy the covariate shift property if they have the same labeling function. Namely, if we have lS (x) = lT (x) for all x ∈ X . We then denote this common labeling function of P S and P T by l. The covariate shift assumption is realistic for many DA tasks. For example, it is a reasonable assumption in many natural language processing (NLP) learning problems, such as part-of-speech tagging, where a learner that trains on documents from one domain is applied to a different domain. For such tasks, it is reasonable to assume that the difference between the two tasks is only in their marginal distributions over English words rather than in the tagging of each word (an adjective is an adjective independently of the type of text it occurs in). While, on first thought, it may seem like under this assumption DA becomes easy, the lower bound in [7] implies that DA remains a very hard learning problem even under covariate shift.

C

Details on lower bound

Density ratio estimation has been advocated as a key tool for domain adaptation in the covariate shift setup ([5, 2, 6, 3]). However, note that density ratio estimation is at least as difficult a task as density estimation (just consider the case where one of the densities is a constant function). Furthermore, now we present a strong lower bound even for a particularly simple version of density ratio estimation. Our lower bound implies that this task cannot be reliably achieved with finite, task-independent sample sizes as long as no further assumptions about the densities involved are made. Formally, we reduce a certain statistical tasks, the Left/Right problem (see Definition C.1 below), to an extremely simple case of density ratio estimation. The Left/Right problem has been shown to require large sample sizes (in the order of the square-root of the domain size) to be reliably solved [7]. Thus, this lower bound on the sample complexity of the Left/Right problem in Lemma 1 implies that this particular case of density ratio estimation is hard. This, in turn, yields a lower bound for the general problem of density ratio estimation, assuming that any “reasonable” notion of successfully estimating density ratios includes, in particular, solving the simple case of Corollary 1. That is, if an algorithm is claimed to be able to estimate density ratios, then it should be able to successfully distinguish between the cases of constant density ratio either 0 or 1. 7

C.1

The Left/Right problem

The Left/Right Problem was introduced in [13] and has been used in [7] to obtain lower bounds for domain adaptation learning under covariate shift. Definition. (Left/Right problem) Input: Three finite samples, L, R and M of points from some domain set X . Output: Assuming that L is an an i.i.d. sample from some distribution P over X , that R is an an i.i.d. sample from some distribution Q over X , and that M is an i.i.d. sample generated by one of these two distributions, was M generated by P or by Q ? Intuitively, as long as the sample M does not have any intersections with either the sample L or M , no algorithm can successfully solve the Left/Right problem. This yields a lower bound on the the size of the required sample sizes that is of the order of the sizes one needs to guarantee a “collision” (hitting some point more than once). It is well known, e.g. from the “Birthday paradox”, that this requires samples in the order of the square-root of the domain size. Definition. We say that a (possibly randomized) algorithm, A, (δ, l, r, m)-solves the Left/Right problem over a class W of pairs of probability distributions, if, for every (P1 , P2 ) ∈ W, given samples L ∼ P1l , R ∼ P2r and M ∼ Pbm , where b ∈ {1, 2}, A(L, R, M ) = b with probability at least 1−δ (over the choice of the samples L, R, M and possibly also over the internal randomization of the algorithm A). The following lower bound for the Left/Right problem was formally shown in [7]. We consider the following classes: Wnhalves = {(UA , UB ) : A ∪ B = {1, . . . n}, A ∩ B = ∅, |A| = |B|}, (recall that, for a finite set Y , UY denotes the uniform distribution over Y ). Lemma 1 ([7]). For any given sample sizes l for L, r for R and m for M and any 0 < γ < 1/2, if k = max{l, r} + m, then for n > max{k 2 / ln(2), k 2 / ln(1/2γ)} no algorithm has probability of success greater than 1 − γ over the class Wnhalves . C.2

Reducing the Left/Right problem to weight ratio estimation

Definition (The extreme two sample problem (ETSP)). Given some finite domain set X let W DRE = {(UA , UB ) : A, B, are subsets of X , |A| = |B| = |X |/2 and either A = B or A∩B = ∅}. We say that a (possibly randomized) algorithm, A, (δ, l, r, m)-solves the Extreme Density Ratio Estimation problem if, for every (UA , UB ) ∈ W DRE , given samples L ∼ P1l , R ∼ P2r , A(L, R) = 1 if A = B and A(L, R) = 0 if A ∩ B = ∅, with probability at least 1 − δ (over the choice of the samples L, R and possibly also over the internal randomization of the algorithm A). Lemma 2. The Extreme Density Ratio Estimation problem is as hard as the Right/Left problem. Proof. Let A be an algorithm that solves the ETSP. Define an algorithm A0 that solves the Right/Left problem for WXhalves , using A as a subroutine,with the same sample complexity. When given a triplet of samples (L, R, M ) from a set X , the algorithm A0 just applies the algorithm A to the pair of samples (L, M ). This implies the lower bound on the simple case of density ratio estimation as stated in Theorem 1.

D

Proof of Theorem 2

Proof. Let < 1/4 be given. First, note that with the sample sizes stated, according to Corollay 1, we can assume that S is an -net with respect to P T for the set H∆H and for the set of of cells (note that a partition always has VC-dimension 1). Similarly, we can assume that and T is an 2 approximation for those same collections of sets (see Appendix E and F respectively). Note that 2 < /4. 8

Let BST denote the union of all cells that contain weighted points of w(S) (that is the union of all cells the algorithm “keeps”). Further let B denote the union of all -heavy cells (that is, cells that have P T weight at least . Since S is an -net, S hits every heavy cell. Since T is an 2 3 approximation of the cells, and 2 < /4, for every -heavy cell, we have |T|T∩b| | > 4 ). Thus we “catch” every -heavy cell, that is B ⊆ BST . Now, the concentratedness assumption implies PXT (BST ) ≥ PXT (B ) ≥ 1 −

(1)

3 T Now, since T is an 2 approximation, |T|T∩b| | > 4 implies that PX (b) > /2 for all b ∈ BST . Thus, PXT (b) ≥ /2 holds for all b ∈ BST . Since ρ = φ−1 (/2), the ρ-margin around the labeling function l has total P T -weight at most /2 and can therefore not contain any cells from BST . This implies, that every cell in BS T is label homogeneous.

b = {h ∈ H : h has a (ρ, 0) margin w.r.t. w(S)}. Recall that H b if err Claim 1. For h, h0 ∈ H, c w(S) (h) ≤ err c w(S) (h0 ) then errP T (h) ≤ errP T (h0 ) + 2. Before proving the claim itself, we now argue that the claim implies the statement of the theorem. Note that since the 2ρ-margin around h∗ has weight at most /2, it can not fully contain any of the cells in BST . Therefore, the ρ-margin around h∗ does not contain any reweighed sample points from b w(S). That is h∗ ∈ H. Consider some h ∈ argminh∈Hb err c w(S) . By the claim, we get errP T (h) ≤ errP T (h∗ ) + 2 ≤ optP T + 3 Proof of the claim 1. Let err c w(S) (h) ≤ err c w(S) (h0 ). Note that since both h and h0 have a (ρ, 0)margin with respect to w(S), the symmetric difference h∆h0 can not properly intersect any of the cells in BST . Now, if the set h∆h0 does not fully contain any cell from BST , then errP T (h) ≤ errP T (h0 ) + , due to Equation 1. If the set h∆h0 does fully contain cells from BST , then all these cells are label homogeneous, since, as shown above, the ρ-margin around the labeling function l can not contain any cells from BST . For the case that h∆h0 contains cells from BST , note that there are at most 1/ heavy cells (cells that have weight at least ). Thus, since every cell is 2 -approximated by T , any union of heavy cells is = 2 · 1 -approximated by T and thus also by w(S). Thus, again invoking Equation 1, we get that the symmetric difference h∆h0 is 2-approximeted by w(S). This implies errP T (h) ≤ errP T (h0 ) + + ≤ errP T (h0 ) + 2

E -nets E.1

-nets

The notion of an -net was introduced by [14] and has many applications in computational geometry and machine learning. Definition. Let X be some domain, W ⊆ 2X a collection of subsets of X and P a distribution over X . An -net for W with respect to P is a subset N ⊆ X that intersects every member of W that has P -weight at least . The following key result concerning -nets states that they are easy to come by. A version of it for uniform distributions appeared first in [14]. The general version, we employ here, can be found as Theorem 28.3 in [15]. 9

Lemma 3 ([14, 15]). Let X be some domain, W ⊆ 2X a collection of subsets of X of some finite VC-dimension,d. Then, for everyprobability distribution P over X, for every > 0 and δ > 0, a set of size O d log(d/)+log(1/δ) sampled i.i.d. from P , is an -net for W with respect to P with probability at least (1 − δ). We now relate -nets for a source distribution to -nets for a target distribution: Lemma 4 (Slight variation of a lemma in [7]). Let X be some domain, W ⊆ 2X a collection of subsets of X , and P S and P T a source and a target distribution over X with C := CW, (P S , P T ) ≥ 0. Then every (C)-net for W with respect to P S is an -net for W with respect to P T . Proof. Let N ⊆ X be an (C)-net for W with respect to P S . Consider a U ∈ W that has targetweight at least , i.e. P T (U ) ≥ . Then we have P S (U ) ≥ CP T (U ) ≥ C. As N is an (C)-net for W with respect to P S , we have N ∩ U 6= ∅. Combining, the above lemmas, we get the following result for samples sizes sufficient for obtaining -nets under a weight ratio assumption: Corollary 1. Let X be some domain, W ⊆ 2X a collection of subsets of X of some finite VC-dimension, d, and let P S and P T be source and a target distributions over X with C := d log(d/C)+log(1/δ) S T CW (P , P ) ≥ 0. Then, for every > 0 and δ > 0, a set of size O sampled C i.i.d. from P S , is an -net for W with respect to P T with probability at least (1 − δ).

F

-approximations

The notion of -approximations play the key role deriving finite sample bounds for VC-classes [16]. Definition (-approximation). Let X be some domain, B ⊆ 2X a collection of subsets of X and P a distribution over X . An -approximation for B with respect to P is a finite subset S ⊆ X with d − P (b)| ≤ |S(b) for all sets b ∈ B. It is shown in [14] (and already in [16]) that, for a collection B of subsets of some domain set X with finite VC-dimension and any distribution P over X , an i.i.d. sample of size 16VC(B) 4 16 VC(B) ln + ln 2 2 δ is an -approximation for B with respect to P with probability at least 1 − δ.

10

carnegie mellon university

Why are Benefits Left on the Table? - Carnegie Mellon University

Bored in the USA - Carnegie Mellon University

reCAPTCHA - Carnegie Mellon School of Computer Science

Survivable Information Storage Systems - Carnegie Mellon University

Bored in the USA - Carnegie Mellon University

Mechanisms for Multi-Unit Auctions - Carnegie Mellon School of ...

DDSS 2006 Paper - CMU Robotics Institute - Carnegie Mellon University

EEG Helps Knowledge Tracing! - Carnegie Mellon School of ...

DDSS 2006 Paper - The Robotics Institute Carnegie Mellon University

Survivable Storage Systems - Parallel Data Lab - Carnegie Mellon ...

The Data Locality of Work Stealing - Carnegie Mellon School of ...

Online Matching and Ad Allocation Contents - Carnegie Mellon School ...

Linear Logic and Strong Normalization - Carnegie Mellon University in ...

D-Nets: Beyond Patch-Based Image Descriptors - Carnegie Mellon ...

The costs of poor health (plan choices ... - Carnegie Mellon University

Linear Logic and Strong Normalization - Carnegie Mellon University in ...