Learning features to compare distributions Arthur Gretton Gatsby Computational Neuroscience Unit, University College London

NIPS 2016 Workshop on Adversarial Learning, Barcelona Spain

1/28

Goal of this talk Have: Two collections of samples X Y from unknown distributions P and Q. Goal: Learn distinguishing features that indicate how P and Q differ.

2/28

Goal of this talk Have: Two collections of samples X Y from unknown distributions P and Q. Goal: Learn distinguishing features that indicate how P and Q differ.

2/28

Divergences

3/28

Divergences

4/28

Divergences

5/28

Divergences

6/28

Divergences

Sriperumbudur, Fukumizu, G, Schoelkopf, Lanckriet (2012) 7/28

Overview The Maximum mean discrepancy: How to compute and interpret the MMD How to train the MMD Application to troubleshooting GANs

The ME test statistic: Informative, linear time features for comparing distributions How to learn these features

TL;DR: Variance matters. 8/28

The maximum mean discrepancy Are P and Q different?

P(x) Q(y)

−6

−4

−2

0

2

4

6

9/28

Maximum mean discrepancy (on sample)

10/28

Maximum mean discrepancy (on sample)

Observe X

x1

xn

P Observe Y

y1

yn

Q

10/28

Maximum mean discrepancy (on sample) Gaussian kernel on xi Gaussian kernel on yi

10/28

Maximum mean discrepancy (on sample) P

v : mean embedding of P

Q

v : mean embedding of Q

v P

v

1 m

m i 1

k xi v 10/28

Maximum mean discrepancy (on sample) P

v : mean embedding of P

Q

v : mean embedding of Q

v witness v

P

v

Q

v

10/28

Maximum mean discrepancy (on sample)

MMD

2

witness v 1 n n 1

2

k xi xj

i j

2 n2

1 n n

1

k yi yj i j

k xi yj i j

11/28

Overview Dogs P and fish Q example revisited Each entry is one of k dogi dogj , k dogi fishj , or k fishi fishj

12/28

Overview The maximum mean discrepancy: MMD

1

2

n n

1

k dogi dogj i j

2 n2

1 n n

1

k fishi fishj i j

k dogi fishj i j

13/28

Asymptotics of MMD The MMD: MMD

1

2

n n

1

k xi xj i j

2 n2

1 n n

1

k yi yj i j

k x i yj i j

but how to choose the kernel?

14/28

Asymptotics of MMD The MMD: MMD

1

2

n n

1

k xi xj i j

2 n2

1 n n

1

k yi yj i j

k x i yj i j

but how to choose the kernel?

Perspective from statistical hypothesis testing: 2

When P

Q then MMD “close to zero”.

When P

Q then MMD “far from zero”

2

2

Threshold c for MMD gives false positive rate 14/28

A statistical test MMD density 0.7

P=Q P≠ Q

d n ⇥ MMD

2

0.6

0.5

Prob. of

0.4

cα = 1−α quantile when P=Q

0.3

0.2

False negatives 0.1

0 −2

−1

0

1

2

3

4

5

6

2

d n ⇥ MMD

15/28

A statistical test MMD density 0.7

P=Q P≠ Q

d n ⇥ MMD

2

0.6

0.5

Prob. of

0.4

cα = 1−α quantile when P=Q

0.3

0.2

False negatives 0.1

0 −2

−1

0

1

2

3

4

5

6

2

d n ⇥ MMD

Best kernel gives lowest false negative rate (=highest power) 15/28

A statistical test MMD density 0.7

P=Q P≠ Q

d n ⇥ MMD

2

0.6

0.5

Prob. of

0.4

cα = 1−α quantile when P=Q

0.3

0.2

False negatives 0.1

0 −2

−1

0

1

2

3

4

5

6

2

d n ⇥ MMD

Best kernel gives lowest false negative rate (=highest power) .... but can you train for this?

15/28

Asymptotics of MMD When P

Q, statistic is asymptotically normal, MMD

2

MMD P Q Vn P Q

D

0 1

where MMD P Q is population MMD, and Vn P Q

O n

1

.

MMD distribution and Gaussian fit under H1 14

Prob. density

12

Empirical PDF Gaussian fit

10 8 6 4 2 0 0

0.05

0.1

0.15

0.2

MMD

0.25

0.3

0.35

0.4 16/28

Asymptotics of MMD Where P

Q, statistic has asymptotic distribution nMMD

2 l l 1

zl2

2 where

MMD density under H0 0.7

χ2 sum Empirical PDF

0.6

i

i

x

k x x

i

x dP x

centred

Prob. density

0.5

zl

0.4

0 2

iid

0.3

0.2

0.1

0 −2

−1

0

1

2

3

4

5

6

n× MMD2 17/28

Optimizing test power The power of our test (Pr1 denotes probability under P

Pr1 nMMD

2

Q):

c

18/28

Optimizing test power The power of our test (Pr1 denotes probability under P

Pr1 nMMD 1

2

c c

n

Q):

Vn P Q

MMD2 P Q Vn P Q

where is the CDF of the standard normal distribution. c is an estimate of c test threshold.

18/28

Optimizing test power The power of our test (Pr1 denotes probability under P

Pr1 nMMD 1

2

c c

n

Q):

Vn P Q O n

3 2

MMD2 P Q Vn P Q O n

1 2

First term asymptotically negligible!

18/28

Optimizing test power The power of our test (Pr1 denotes probability under P

Pr1 nMMD 1

2

c c

n

Q):

Vn P Q

MMD2 P Q Vn P Q

To maximize test power, maximize MMD2 P Q Vn P Q (Sutherland, Tung, Strathmann, De, Ramdas, Smola, G., in review for ICLR 2017)

Code: github.com/dougalsutherland/opt-mmd 18/28

Troubleshooting for generative adversarial networks

MNIST samples

Samples from a GAN

19/28

Troubleshooting for generative adversarial networks

MNIST samples

Samples from a GAN Power for optimzed ARD kernel: 1.00 at 0 01

ARD map

Power for optimized RBF kernel: 0.57 at 0 01

19/28

Benchmarking generative adversarial networks

20/28

The ME statistic and test

21/28

Distinguishing Feature(s) P

v : mean embedding of P

Q

v : mean embedding of Q

v witness v

P

v

Q

v

22/28

Distinguishing Feature(s) witness2 v

Take square of witness (only worry about amplitude) 23/28

Distinguishing Feature(s)

New test statistic: witness2 at a single v ; Linear time in number n of samples ....but how to choose best feature v ?

23/28

Distinguishing Feature(s)

v

Best feature = v that maximizes witness2 v

?? 23/28

Distinguishing Feature(s) witness2 v

Sample size n

3

24/28

Distinguishing Feature(s)

Sample size n

50

24/28

Distinguishing Feature(s)

Sample size n

500

24/28

Distinguishing Feature(s)

Pwx) Qwy) wittess 2 wv)

Population witness2 function

24/28

Distinguishing Feature(s)

Pwx) Qwy) wittess 2 wv)

v?

v?

24/28

Variance of witness function Variance at v = variance of X at v + variance of Y at v. witness2 v ME Statistic: n v n variance of v .

25/28

Variance of witness function Variance at v = variance of X at v + variance of Y at v. witness2 v ME Statistic: n v n variance of v .

25/28

Variance of witness function Variance at v = variance of X at v + variance of Y at v. witness2 v ME Statistic: n v n variance of v .

Pwx) Qwy) wittess 2 wv)

25/28

Variance of witness function Variance at v = variance of X at v + variance of Y at v. witness2 v ME Statistic: n v n variance of v .

wittess 2 wv) vsristce X wv)

25/28

Variance of witness function Variance at v = variance of X at v + variance of Y at v. witness2 v ME Statistic: n v n variance of v .

wittess 2 wv) vsristce Y wv)

25/28

Variance of witness function Variance at v = variance of X at v + variance of Y at v. witness2 v ME Statistic: n v n variance of v .

wittess 2 wv) vsristce of v

25/28

Variance of witness function Variance at v = variance of X at v + variance of Y at v. witness2 v ME Statistic: n v n variance of v .

λˆn (v)

v∗

Best location is v that maximizes n . Improve performance using multiple locations vj

J j 1

25/28

Distinguishing Positive/Negative Emotions

happy

neutral

surprised

35 females and 35 males (Lundqvist et al., 1998). 48 34 1632 dimensions. Pixel features. Sample size: 402.

afraid

angry

disgusted

The proposed test achieves maximum test power in time O n . Informative features: differences at the nose, and smile lines. 26/28

Distinguishing Positive/Negative Emotions 5andRP feature

neutral

surprised

1.0

PRwer ⟶

happy

0.5 0.0

afraid

angry

+ vs. -

disgusted

The proposed test achieves maximum test power in time O n . Informative features: differences at the nose, and smile lines. 26/28

Distinguishing Positive/Negative Emotions 5andRP feature PrRpRsed

neutral

surprised

1.0

PRwer ⟶

happy

0.5 0.0

afraid

angry

+ vs. -

disgusted

The proposed test achieves maximum test power in time O n . Informative features: differences at the nose, and smile lines. 26/28

Distinguishing Positive/Negative Emotions 5DndRP feDture PrRpRsed 00D (quDdrDtic tiPe)

neutral

surprised

1.0

PRwer ⟶

happy

0.5 0.0

afraid

angry

+ vs. -

disgusted

The proposed test achieves maximum test power in time O n . Informative features: differences at the nose, and smile lines. 26/28

Distinguishing Positive/Negative Emotions

happy

neutral

surprised

afraid

angry

disgusted

Learned feature The proposed test achieves maximum test power in time O n . Informative features: differences at the nose, and smile lines.

26/28

Distinguishing Positive/Negative Emotions

happy

neutral

surprised

afraid

angry

disgusted

Learned feature The proposed test achieves maximum test power in time O n . Informative features: differences at the nose, and smile lines. Code: https://github.com/wittawatj/interpretable-test

26/28

Final thoughts Witness function approaches: Diversity of samples: MMD test uses pairwise similarities between all samples ME test uses similarities to J reference features

Disjoint support of generator/data distributions Witness function is smooth

Other discriminator heuristics: Diversity of samples by minibatch heuristic (add as feature distances to neighbour samples) Salimans et al. (2016) Disjoint support treated by adding noise to “blur” images Arjovsky and Bottou (2016), Huszar (2016)

27/28

Co-authors Students and postdocs: Kacper Chwialkowski (at Voleon) Wittawat Jitkrittum Heiko Strathmann Dougal Sutherland

Collaborators

Questions?

Kenji Fukumizu Krikamol Muandet Bernhard Schoelkopf Bharath Sriperumbudur Zoltan Szabo 28/28

Testing against a probabilistic model

29/28

Statistical model criticism MMD P Q

f

2

sup

f

1

EQ f

Ep f

0.4 0.3 0.2

p(x)

0.1 -4

q(x) 2

-2

4

-0.1

f *(x)

-0.2 -0.3

f

x is the witness function

Can we compute MMD with samples from Q and a model P ? Problem: usualy can’t compute Ep f in closed form. 30/28

Stein idea To get rid of Ep f in sup Eq f

Ep f

1

f

we define the Stein operator Tp f

xf

f

x

log p

Then EP T P f

0

subject to appropriate boundary conditions.

(Oates, Girolami, Chopin, 2016)

31/28

Maximum Stein Discrepancy Stein operator

Tp f

xf

f

x

log p

Maximum Stein Discrepancy (MSD) MSD p q

sup Eq Tp g

g

1

Ep Tp g

32/28

Maximum Stein Discrepancy Stein operator

Tp f

xf

f

x

log p

Maximum Stein Discrepancy (MSD) MSD p q

sup Eq Tp g

g

1

⇠ Ep⇠ T⇠ pg ⇠

32/28

Maximum Stein Discrepancy Stein operator

Tp f

xf

f

x

log p

Maximum Stein Discrepancy (MSD) MSD p q

sup Eq Tp g

g

1

⇠ Ep⇠ T⇠ pg ⇠

sup Eq Tp g

g

1

32/28

Maximum Stein Discrepancy Stein operator Tp f

xf

f

x

log p

Maximum Stein Discrepancy (MSD) sup Eq Tp g

MSD p q

g

1

⇠ Ep⇠ T⇠ pg ⇠

sup Eq Tp g

g

1

0.4 0.2 -4

2

-2 -0.2 -0.4

4

p(x) q(x) g *(x)

-0.6 32/28

Maximum Stein Discrepancy Stein operator Tp f

xf

f

x

log p

Maximum Stein Discrepancy (MSD) sup Eq Tp g

MSD p q

g

1

⇠ Ep⇠ T⇠ pg ⇠

sup Eq Tp g

g

1

0.4 0.3

p(x)

0.2

q(x) g *(x)

0.1

-4

-2

2

4 32/28

Maximum stein discrepancy Closed-form expression for MSD: given Z Z Strathmann, G., 2016) (Liu, Lee, Jordan 2016)

MSD p q

q, then

(Chwialkowski,

Eq hp Z Z

where hp x y

x

log p x

x

log p y k x y

y

log p y

xk

x y

x

log p x

yk

x y

x yk

x y

and k is RKHS kernel for

Only depends on kernel and x log p x . Do not need to normalize p, or sample from it. 33/28

Statistical model criticism Solar activity (normalised)

3 2 1 0 1 2 1600

1700

1800

1900

2000

Year

Test the hypothesis that a Gaussian process model, learned from data , is a good fit for the test data (example from Lloyd and Ghahramani, 2015)

Code: https://github.com/karlnapf/kernel_goodness_of_fit 34/28

Statistical model criticism 0.030 Vn test Bootstrapped Bn

Frequency

0.025 0.020 0.015 0.010 0.005 0.000 0

50

100

150 Vn

200

250

300

Test the hypothesis that a Gaussian process model, learned from data , is a good fit for the test data 35/28

Learning features to compare distributions

Goal of this talk. Have: Two collections of samples X Y from unknown distributions. P and Q. Goal: Learn distinguishing features that indicate how P and Q differ. 2/28 ...

5MB Sizes 2 Downloads 225 Views

Recommend Documents

An Architecture for Learning Stream Distributions with Application to ...
the stream. To the best of our knowledge this is the first ... publish, to post on servers or to redistribute to lists, requires prior specific permission ..... 3.4 PRNG and RNG Monitoring ..... Design: Architectures, Methods and Tools (DSD), 2010.

An Architecture for Learning Stream Distributions with Application to ...
chitecture for learning the CDF of a data stream and apply our technique to the .... stitute of Standards and Technology recommendation [19]. Our contribution ...

key features for math teachers - Proven Learning
and analyze data, and then automatically transfer grades into any electronic gradebook. GRIDDED ANSWERS. Customize vertical columns with numbers and.

Compare--FederalistsAnti.pdf
Sign in. Page. 1. /. 1. Loading… Page 1 of 1. Page 1 of 1. Compare--FederalistsAnti.pdf. Compare--FederalistsAnti.pdf. Open. Extract. Open with. Sign In.

Wifi Compare Wimax.pdf
Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Wifi Compare Wimax.pdf. Wifi Compare Wimax.pdf. Open. Extract.

compare contrast writing.pdf
Whoops! There was a problem loading more pages. Retrying... compare contrast writing.pdf. compare contrast writing.pdf. Open. Extract. Open with. Sign In.

VOIP Compare IPTV.pdf
Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. VOIP Compare IPTV.pdf. VOIP Compare IPTV.pdf. Open. Extract.

Compare Contrast Defarges.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Compare ...

Focus, Then Compare
∗Department of Economics, University of Chicago, [email protected]. This paper .... Given a full support random choice rule ρ and any A ⊂ X, I take ρy (x, A) ...

It Makes Sense to Compare Flyer.pdf
Whoops! There was a problem loading this page. It Makes Sense to Compare Flyer.pdf. It Makes Sense to Compare Flyer.pdf. Open. Extract. Open with. Sign In.

Learning with Augmented Features for Supervised and ...
... applications clearly demonstrate that our SHFA and HFA outperform the existing HDA methods. Index Terms—Heterogeneous domain adaptation, domain adaptation, transfer learning, augmented features. 1 INTRODUCTION. IN real-world applications, it is

A protocol to compare nestedness among submatrices
rows and columns extracted from the complete matrix) is more or ...... at the border line between Anatolian and Cycladic groups did not alter the ..... tempcalc.html.

Learning Features by Contrasting Natural Images with ...
Michael Gutmann – University of Helsinki. ICANN2009: Learning ... Estimation method: Fit the parameters in the classifier to the data (supervised learning!) 3.

Learning Invariant Features Using Inertial Priors ... - Semantic Scholar
Nov 30, 2006 - Department of Computer Science .... Figure 1: The simple three-level hierarchical Bayesian network (a) employed as a running example in the ...

Modulation of Learning Rate Based on the Features ...
... appear to reflect both increases and decreases from baseline adaptation rates. Further work is needed to delineate the mechanisms that control these modulations. * These authors contributed equally to this work. 1. Harvard School of Engineering a

Learning Rich Features for Image Manipulation ... - CVF Open Access
The noise stream first obtains the noise feature map ..... streams. Bilinear pooling [23], first proposed for fine- ..... [34] Y. Zhang, J. Goh, L. L. Win, and V. L. Thing.

Learning hierarchical invariant spatio-temporal features for action ...
way to learn features directly from video data. More specif- ically, we .... We will call the first and second layer units simple and pool- ing units, respectively.

Learning features by contrasting natural images with ...
1 Dept. of Computer Science and HIIT, University of Helsinki,. P.O. Box 68, FIN-00014 University of Helsinki, Finland. 2 Dept. of Mathematics and Statistics, University of ... rameterized family of probability distributions. In non-overcomplete ICA,

Learning Invariant Features Using Inertial Priors ... - Semantic Scholar
Nov 30, 2006 - Supervised learning with a sufficient number of correctly labeled ..... Figure 5: An illustration of a hidden Markov model embedded in a ...... At this point, we are concentrating on a C++ implementation that will free us from our ...

Learning from Labeled Features using Generalized ...
Jul 20, 2008 - tion f, and a conditional model distribution p parameterized ... expectation ˆf, an empirical distribution ˜p, a function f, and ..... mac: apple, mac.

Search features
Search Features: A collection of “shortcuts” that get you to the answer quickly. Page 2. Search Features. [ capital of Mongolia ]. [ weather Knoxville, TN ]. [ weather 90712 ]. [ time in Singapore ]. [ Hawaiian Airlines 24 ]. To get the master li

Distribution of Objects to Bins: Generating All Distributions
has the lowest level i.e. 0. The level ... the sum of one or more positive integers xi, i.e., n = x1 + x2 + . ..... [7] A. S. Tanenbaum, Modern Operating Systems, Pren-.

How Does Pennsylvania's Tax Burden Compare? - Commonwealth ...
Aug 25, 2016 - California. 11.00%. 42.81%. 7.67%. Average. 10.76%. 39.74%. 4.81%. Sources: Tax Foundation, US Bureau of Economic Analysis, US Bureau ...