A norm-concentration argument for non-convex regularisation

Ata Kab´ an Robert J. Durrant School of Computer Science The University of Birmingham Birmingham B15 2TT, UK

ICML/UAI/COLT Workshop on Sparse Optimization and Variable Selection Helsinki, 9 July 2008.

Introduction L1-regularisation - a workhorse in machine learning • sparsity • convexity • logarithmic sample complexity Non-convex norm regularisation - seems to have added value • statistics (Fan & Li, ’01): oracle property • signal proc. (Chartland, ’07), signal reconstruction (Wipf & Rao, ’05) • 0-norm SVM classification (Weston et al., ’03) (results data-dependent) • genomic data classification (Liu et al., ’07)

Regularised regression in high dimensions Training set {(xj , yj )}nj=1 , where xj ∈ Rm are m-dimensional inputs and yj ∈ {−1, 1} are their labels. Scenario of interest: few r << m relevant features, small sample size n << m. Consider regularised logistic regression for concreteness: max w

n X

log p(yj |xj , w) subject to ||w||q ≤ A

(1)

j=1

where w ∈ R1×m are unknown parameters, p(y|wT x) = 1/(1 + exp(−ywT x)), P q 1/q . and ||w||q = ( m i=1 |wi | ) If q = 2: L2-regularised (’ridge’) logistic regression. If q = 1: L1-regularised (’lasso’) logistic regression. If q < 1: Lq<1-regularised logistic regression: non-convex, non-differentiable at 0

A word on some recent estimation algorithms 2.5

2.0 2.0

1.5 1.5

1.0 1.0

0.5

0.5

K

4

K

2

0

w

i

2

4

K

4

K

2

0

w

2

4

i

Local quadratic (Fan & Li,’01) vs. local linear (Zou & Li,’08) bound, tangent at ±3. [Despite the latter appears to be a closer approximation, framing the iterative estimation into the E-M methodology framework, it turns out they are in fact equivalent (Kaban & Durrant, ECML’08)]

Sample complexity bound H = {h(x, y) = − log p(y|wT x) : x ∈ Rm , y ∈ {−1, 1}} the function class erP (h) = E(x,y)∼iidP [h(x, y)] the true error of h P er ˆ z (h) = n1 ni=1 h(xi , yi ) the sample error of h on training set z of size n optP (H) = inf h∈H erP (h) the approximation error of H

L(z) = minh∈H er ˆ z (h) function returned by the learning algorithm

Theorem (A.Ng,’04, extended from L1 to Lq<1). ∀ǫ > 0, ∀δ > 0, ∀m, n ≥ 1, in order to ensure that erP (L(z)) ≤ optP (H) + ǫ with probability 1 − δ, it is enough to have n = Ω((log m) × poly(A, r 1/q , 1/ǫ, log(1/δ)) - logarithmic in dimensionality m; - polynomial in #relevant features, but growing with r 1/q

(2)

test logloss

0−1 error

30 20

validation logloss

1.5

40

1 0.5

10 0.2

0.4

0.6 q

0.8

1

0.2

r=5

r=10

0.4

0.6 q r=30

0.8

1 r=50

1.4 1.2 1 0.8 0.6 0.4 0.2 0.2

0.4

0.6 q

0.8

r=100

Experiments on m = 200 dimensional data sets, varying the number of relevant features r ∈ {5, 10, 30, 50, 100}. The medians of 60 independent trials are shown and the error bars represent one standard error. The 0-1 errors are out of 100.

1

A norm concentration view Consider the un-regularised version of the problem. Because n << m, the system is under-determined, and so, m − n components of w can be set arbitrarily. We can model the arbitrary components of w as being i.i.d. Uniform: wi ∼ Unif[−a, a], ∀i ∈ {n + 1, ..., m} with some large a. The regularisation term is meant to constrain the problem to make it well-posed. However, in very high dimensions, a counter-intuitive phenomenon known as the concentration of distances and norms comes into play. The regularisation term becomes essentially the same for all the infinitely many possible maximisers of the likelihood term.

Distance concentration Distance concentration is the counter-intuitive phenomenon that, as the data dimensionality increases without bounds, all pairwise distances between points become identical. This phenomenon affects every area, where high-dimensional data processing is required — e.g. database indexing & retrieval, data analysis, statistical machine learning. Concentration of the L2-norm (Demartinez,’94) Let x ∈ Rm a random vector with i.i.d. components of any distribution. Then, lim

E[||x||2]

m→∞ m1/2

= const.;

lim Var[||x||2] = const.

m→∞

(3)

Concentration of arbitrary dissimilarity functions in arbitrary multivariate distributions (Beyer et al.,99). (m)

(m)

Let Fm , m = 1, 2, . . . be an infinite sequence of data distributions and x1 , . . . , xn a random sample of n independent data vectors distributed as Fm . For each m, let ||.|| : dom(Fm) → R+ be a function that takes a point from the domain of Fm and returns a positive real value. p > 0 an arbitrary positive constant Assume that E[||x(m) ||p ] and Var[||x(m) ||p ] are finite and E[||x(m) ||p ] 6= 0. Var[(||x(m) ||)p] = 0, then, If lim m→∞ E[(||x(m)»||)p]2 – (m) (m) ∀ǫ > 0, lim P max ||xj || ≤ (1 + ǫ) min ||xj || = 1. m→∞ 1≤j≤n 1≤j≤n

Sample estimate of Var[x2] / E[x2]2

0.5 0.4 0.3 0.2 0.1 0

0

100

200

300 400 dimensions (m)

500

600

700

0

100

200

300 400 dimensions (m)

500

600

700

log ( DMAXm / DMINm )

5 4 3 2 1 0

Var[(||x||q )p] (p) Applying this to our problem. Denote RVm (||x||q ) = E[(||x||q )p ]2

Using the independence of wn+1, ..., wm, we get: Pm P n Pn q] q q Var[|w | Cov[|w | , |w | ] + i i j i=n+1 j=1 i=1 (q) P m Pm RVm (||w||q ) = q q j=1 E[|wi | ]E[|wj | ] i=1 which converges to 0 as m → ∞.

Hence, the problem remains ill-posed despite the regularisation.

The effect of q Fortunately, not all norms concentrate at the same rate. x ~ Unif[0,1]

Sample estimate of Var[||x||2] / E[||x||2]2

0.25 L2−norms L1−norms L0.5−’norms’ L0.1−’norms’ 0.2

0.15

0.1

0.05

0

0

5

10

15

20

25

30

dimensions (m)

35

40

45

50

Theorem (Francois et al.’07, extended). If w ∈ Rm is a random vector with no more than n < m non-iid components, where n is finite, and all the other components being i.i.d, then Var[||w||q ] 1 σ2 = 2 2 lim m m→∞ E[||w||q ]2 q µ

(4)

where µ = E[wn+1], σ 2 = Var[wn+1] and n + 1 is one of the i.i.d. dimensions of w. Applying this to w, we can use (4) to approximate Var[||w||q ] 1 1 σ2 ≈ = ... 2 2 2 E[||w||q ] mq µ for some large m.

(5)

Computing µ and σ 2 for wn+1 ∼ Unif[−a, a]: Z a aq q 1 q = |wn+1 | µ = E[|wn+1 | ] = 2a q+1 −a a2q q 2 2 2q q 2 σ = E[|wn+1 | ] − E[|wn+1 | ] = (2q + 1)(q + 1)2

So, Var[||w||q ] 1 1 ≈ E[||w||q ]2 m 2q + 1

(6)

(Conveniently, a cancels out in this computation.)

Observe this is a decreasing function of q. Thus, the smaller the q the better, from the point of countering the concentration of the norm in regularisation.

3 relevant features

8

Test 0−1 errors

Test 0−1 errors

1 relevant feature

6 4 2 0

200

400 q=0.1

600

10 5

800 1000

200

q=0.3

q=0.5

0.3

400

600

q=0.7

800 1000 q=1

0.4

Test logloss

Test logloss

15

0.2 0.1

200

400 600 800 1000 Data dimensions

0.3 0.2 0.1 200

400 600 800 1000 Data dimensions

Comparative results on 1000-dimensional synthetic data from (Ng,’04). Each point is the median of > 100 indep. trials. The 0-1 errors are out of 100.

Train + valid. set size = 52+23

Train + valid. set size = 35+15

35 40 0−1 test errors

0−1 test errors

30 25 20 15 10

30 20 10

5 0

0 q=0.1

q=0.5

q=1

q=0.1

Train + valid. set size = 52+23

q=0.5

q=1

Train + valid. set size = 35+15 0.8

0.6 0.6 test logloss

test logloss

0.5 0.4 0.3 0.2

0.4 0.2

0.1 0

0 q=0.1

q=0.5

q=1

q=0.1

q=0.5

q=1

Results on 5000-dimensional synthetic data with only one relevant feature and even smaller sample size. The improvement over L1 becomes larger. (The 0-1 errors are out of 100.)

3 relevant features

4 2 0.4

0.6 0.8 q 1 relevant feature

1

0.25 0.2

0.2

0.15 0.1

0.6 0.8 q 3 relevant features

0.05 0.2

0.4

0.6 q

0.8

0.2 0.1 0.2

40 20 0 1 2 3 4 5 6 7 8 9 10 q*10

0.6 q

0.8

12 10 8 0.6 0.8 q exp decay relevance

40 20 0

1 2 3 4 5 6 7 8 9 10 q*10

0.4

0.2

0.4

1

0.35 0.3 0.25

1

60

0.2

0.4

3 relevant features

# features retained

# features retained

1 relevant feature

0.4

14

1

0.3

1

60

0.4

0.6 q

0.8

1

exp decay relevance

# features retained

0.2

Test logloss

Test logloss

0

Test 0−1 errors

6

exp decay relevance

14 12 10 8 6 4 2

Test logloss

8

Test 0−1 errors

Test 0−1 errors

1 relevant feature

40 30 20 10 0 1 2 3 4 5 6 7 8 9 10 q*10

Results on synthetic data from (A.Ng,’04). Training set size n1 = 70, validation set size=30, and the out-of-sample test set size=100. The statistics are over 10 independent runs with dimensionality ranging from 100 to 1000.

Discussion & further work The learning-theoretic sample complexity bound for generalization is an (loose) upper-bound only. Our analysis based on norm-concentration so far only used that n << m. Further work should examine of the effect of r << m from this perspective. The phenomenon of concentration of norms and distances in very high dimensions impacts all high dimensional problems. Its implications for learning and generalization (and for other areas) is and open question.

References C.C. Aggarwal, A. Hinneburg, & D.A. Keim. On the surprising behavior of distance metrics in high dimensional space. Proc. Int. Conf. Database Theory, 2001, pp. 420-434. K. Beyer, J. Goldstein, R. Ramakrishnan, & U. Shaft. When is nearest neighbor meaningful? Proc. Int. Conf. Database Theory, pp. 217-235, 1999. D Fran¸ cois, V Wertz, & M Verleysen. The concentration of fractional distances. IEEE Trans. on Knowledge and Data Engineering, vol 19, no 7, July 2007 A Kab´ an and R.J Durrant. Learning with Lq<1 vs. L1 -norm regularization with exponentially many irrelevant features. Proc. ECML 2008, to appear. Z Liu, F Jiang, G Tian, S Wang, F Sato, S.J Meltzer, M Tan. Sparse Logistic Regression with Lp Penalty for Biomarker Identification. Statistical Applications in Genetics and molecular Biology. Vol.6, Issue 1, 2007. A.Y. Ng. Feature selection, L1 vs. L2 regularization, and rotational invariance. Proc. ICML 2004. Hui Zou and Runze Li: One-step sparse estimates in non-concave penalized likelihood models. The Annals of Statistics, 2008.

A norm-concentration argument for non-convex ... - Semantic Scholar

(Chartland, '07), signal reconstruction (Wipf & Rao, '05). • 0-norm SVM classification (Weston et al., '03) (results data-dependent). • genomic data classification ...

121KB Sizes 2 Downloads 277 Views

Recommend Documents

A norm-concentration argument for non-convex ... - Semantic Scholar
Local quadratic (Fan & Li,'01) vs. local linear (Zou & Li,'08) bound, tangent at ±3. [Despite the latter appears to be a closer approximation, framing the iterative estimation into the E-M methodology framework, it turns out they are in fact equival

Using Argument Mapping to Improve Critical ... - Semantic Scholar
Feb 4, 2015 - The centrality of critical thinking (CT) as a goal of higher education is uncon- troversial. In a recent high-profile book, ... dents college education appears to be failing completely in this regard: “With a large sample of more than

A Appendix - Semantic Scholar
buyer during the learning and exploit phase of the LEAP algorithm, respectively. We have. S2. T. X t=T↵+1 γt1 = γT↵. T T↵. 1. X t=0 γt = γT↵. 1 γ. (1. γT T↵ ) . (7). Indeed, this an upper bound on the total surplus any buyer can hope

A Appendix - Semantic Scholar
The kernelized LEAP algorithm is given below. Algorithm 2 Kernelized LEAP algorithm. • Let K(·, ·) be a PDS function s.t. 8x : |K(x, x)| 1, 0 ↵ 1, T↵ = d↵Te,.

A demographic model for Palaeolithic ... - Semantic Scholar
Dec 25, 2008 - A tradition may be defined as a particular behaviour (e.g., tool ...... Stamer, C., Prugnolle, F., van der Merwe, S.W., Yamaoka, Y., Graham, D.Y., ...

Biotechnology—a sustainable alternative for ... - Semantic Scholar
Available online 24 May 2005. Abstract. This review outlines the current and emerging applications of biotechnology, particularly in the production and processing of chemicals, for sustainable development. Biotechnology is bthe application of scienti

Biotechnology—a sustainable alternative for ... - Semantic Scholar
May 24, 2005 - needsQ, as defined by World Commission on Environment and Development (Brundt- ... security, habitat loss and global health, all in the context of social justice and ...... Hsu J. European Union's action plan for boosting the competiti

Anesthesia for ECT - Semantic Scholar
Nov 8, 2001 - Successful electroconvulsive therapy (ECT) requires close collaboration between the psychiatrist and the anaes- thetist. During the past decades, anaesthetic techniques have evolved to improve the comfort and safety of administration of

A Logic for Communication in a Hostile ... - Semantic Scholar
We express and prove with this logic security properties of cryptographic .... Research on automatic verification of programs gave birth to a family of non- ...... Theorem authentication: If A receives message m2 that contains message m0.

A Logic for Communication in a Hostile ... - Semantic Scholar
Conference on the foundations of Computer Science,1981, pp350, 357. [Glasgow et al. ... J. Hintikka, "Knowledge and Belief", Cornell University Press, 1962.

A Randomized Algorithm for Finding a Path ... - Semantic Scholar
Dec 24, 1998 - Integrated communication networks (e.g., ATM) o er end-to-end ... suming speci c service disciplines, they cannot be used to nd a path subject ...

Considerations for Airway Management for ... - Semantic Scholar
Characteristics. 1. Cervical and upper thoracic fusion, typically of three or more levels. 2 ..... The clinical practice of airway management in patients with cervical.

A note on performance metrics for Speaker ... - Semantic Scholar
Jun 9, 2008 - regardless of the (analysis) condition it happens to be part of. .... of hard decisions is replaced by a log-error measure of the soft decision score.

A Discriminative Learning Approach for Orientation ... - Semantic Scholar
180 and 270 degrees because usually the document scan- ning process results in .... features, layout and font or text-printing technology. In Urdu publishing ...

A Multicast Protocol for Physically Hierarchical Ad ... - Semantic Scholar
Email:[email protected]. Abstract—Routing and multicasting in ad hoc networks is a matured research subject. Most of the proposed algorithms assume a ...

Notio - A Java API for developing CG tools - Semantic Scholar
providing a platform for the development of tools and applications. ... to access the underlying graph implementation (e.g. Deakin Toolset [3] and CGKEE. [4]).

A Topological Approach for Detecting Twitter ... - Semantic Scholar
marketing to online social networking sites. Existing methods ... common interest [10–12], these are interaction-based methods which use tweet- ..... categories in Twitter and we selected the five most popular categories among them.3 For each ...

A Role for Cultural Transmission in Fertility ... - Semantic Scholar
asymmetric technological progress in favor of Modernists provokes a fertility transition ..... These results would have been symmetric to the modernists' ones. 13 ...

Ethical Oocytes: Available for a Price - Semantic Scholar
Science 14 July 2006: Vol. ... DOI: 10.1126/science.313.5784.155b ... The group, which collected oocytes for its own experiments and also for the company.

Towards a 3D digital multimodal curriculum for the ... - Semantic Scholar
Apr 9, 2010 - ACEC2010: DIGITAL DIVERSITY CONFERENCE ... students in the primary and secondary years with an open-ended set of 3D .... [voice over or dialogue], audio [music and sound effects], spatial design (proximity, layout or.

A note on performance metrics for Speaker ... - Semantic Scholar
Jun 9, 2008 - performance evaluation tools in the statistical programming language R, we have even used these weighted trials for calculating Cdet and Cllr, ...

Construction By Configuration: a new challenge for ... - Semantic Scholar
A system for supporting the development of a vegetation atlas. • A system for tracking samples used in destructive testing. • A system to manage information for a ...