SMOOTHNESS MAXIMIZATION VIA GRADIENT DESCENTS Bin Zhao, Fei Wang, Changshui Zhang State Key Laboratory of Intelligent Technologies and Systems, Department of Automation, Tsinghua University, Beijing 100084, P.R.China ABSTRACT The recent years have witnessed a surge of interest in graph based semi-supervised learning. However, despite its extensive research, there has been little work on graph construction. In this study, employing the idea of gradient descent, we propose a novel method called Iterative Smoothness Maximization (ISM), to learn an optimal graph automatically for a semi-supervised learning task. The main procedure of ISM is to minimize the upper bound of semi-supervised classification error through an iterative gradient descent approach. We also prove the convergence of ISM theoretically, and finally experimental results on two real-world data sets are provided to demonstrate the effectiveness of ISM. Index Terms— Semi-Supervised Learning (SSL), Cluster Assumption, Gaussian Function, Gradient Descent 1. INTRODUCTION In many practical applications of pattern classification and data mining, one often faces a lack of sufficient labeled data, since labeling often requires expensive human labor and much time. However, in many cases, large number of unlabeled data can be far easier to obtain. Consequently, Semi-Supervised Learning (SSL) methods, which aim to learn from partially labeled data, are proposed[1]. In general, these methods can be categorized into two classes: transductive learning(e.g. [2]) and inductive learning(e.g. [3]). The goal of transductive learning is to estimate the labels of the given unlabeled data, whereas inductive learning tries to induce a decision function which has a low error rate on the whole sample space. In recent years, graph based semi-supervised learning has become one of the most active research areas in SSL community [4]. The key to graph based SSL is the cluster assumption [5]. It states that (1) nearby points are likely to have the same label; (2) points on the same structure (such as a cluster or a submanifold) are likely to have the same label. Based on the above assumptions, graph-based SSL uses a graph G =< V, E > to describe the structure of a data set, where V is the node set corresponding to the labeled and unlabeled examples in the data set, and E is the edge set. In

most of the traditional methods[2][5][6], each edge eij ∈ E is associated with a weight, usually computed by Gaussian functions which reflect the similarities between pairwise data points, i.e. wij = exp (−kxi − xj k2 /(2σ 2 ))

(1)

However, as pointed out by [1], although graph construction is at the heart of graph based SSL, it is still a problem that has not been well studied. More concretely, the variance σ in Eq.(1) can affect the final classification result significantly, which can be seen from the toy example shown in Fig.1, but according to [2], there has been no reliable approach that can determine an optimal σ automatically so far. To address such a problem, we propose a gradient based method, Iterative Smoothness Maximization (ISM), which aims at learning both data labels and the optimal hyperparameter of the Gaussian function for constructing the graph. The ISM algorithm first establishes a cost function composed of two parts, i.e. the smoothness and the fitness of the data labels, to measure how good the classification result of the SemiSupervised Learning task is. Then ISM will minimize this cost function by alternating the smoothness maximization step and the graph reconstruction step. We will prove theoretically the convergence of the ISM algorithm. The rest of this paper is organized as follows. In section 2, we introduce some works related to this paper. The ISM algorithm is presented in detail in section 3. In section 4, we analyze the convergence property of the algorithm. Experimental results on two real-world data sets are provided in section 5, followed by the conclusions and future works in section 6. 2. NOTATIONS AND RELATED WORKS In this section we will introduce some notations and briefly review some related works of this paper. Given a point set X = {x1 , · · · , xl , xl+1 , · · · , xn } and a label set L = {1, · · · , c}, where the first l points in X are labeled as ti ∈ L, while the remaining points are unlabeled. Our goal is to predict the labels of the unlabeled points1 . We 1 In this paper we will concentrate on the transductive setting. One can easily extend our algorithm to inductive setting using the method introduced in [7].

Toy Data (Two−moon)

2

2

unlabeled point labeled point +1 labeled point −1

1.5

Classification Result with Sigma=0.15

2

1.5

1.5

1

1

1

0.5

0.5

0.5

0

0

0

−0.5

−0.5

−0.5

−1

−1

−1

−1.5 −2

−1.5 −2

−1.5 −2

−1

0

(a)

1

2

3

Classification Result with Sigma=0.4

−1

0

(b)

1

2

3

−1

0

(c)

1

2

3

Fig. 1. Classification results on the two-moon pattern using the method in [2], a powerful transductive approach operating on graph with the edge weights computed by a Gaussian function. (a) toy data set with two labeled points; (b) classification results with σ = 0.15; (c) classification results with σ = 0.4. We can see that a small variation of σ will cause a dramatically different classification result. denote the initial labels in data set by an n × c matrix T . For each labeled point xi , Tij = 1 if xi is labeled as ti = j and Tij = 0 otherwise. For unlabeled points, the corresponding rows in T will be zero. The classification result on the data set X is also represented as an n×c matrix F = [F1T , . . . , FnT ]T , which determines the label of xi by ti = arg max1≤j≤c Fij . In graph based semi-supervised learning, we construct the n × n weight matrix W for graph G with its (i, j)-th entry Wij = wij computed by Eq.(1) , and Wii = 0. The degree matrix D for graph G is defined as an n × n matrix with its (i, i)-entry equal to the sum of the i-th row of W . Based on the above preliminaries, Zhou et al proposed the Learning with Local and Global Consistency (LLGC) algorithm to tackle the SSL problem by minimizing the following cost function[2]:   ° °2 n n ° F ° X 1 X F ° i j ° Q= Wij ° √ −p (2) ° +µ kFi−Ti k2  ° Dii 2 Djj ° i,j=1

i=1

The first term measures the smoothness of the data labels, and the second term restricts that a good classifying function should not change too much from the initial label assignment. The regularization parameter µ > 0 adjusts the tradeoff between these two terms. Thus, the optimal classification function can be obtained by: F ∗ = arg minF Q. As we noted in section 1, one of the problems existing in these graph based methods is that the hyperparameter (i.e. σ in Eq.(1)) can affect the final classification results significantly. Therefore, many methods have been proposed to determine the optimal hyperparameter automatically, such as the Local Scaling [8] and Minimum Spanning Tree [6] methods. Although they can work well empirically, they are heuristic, and thus lack a theoretical foundation. 3. ITERATIVE SMOOTHNESS MAXIMIZATION In this section we will introduce the main procedure of the Iterative Smoothness Maximization (ISM) algorithm.

3.1. Gradient Computing We propose to learn the optimal σ through a gradient descent procedure. More concretely, employing the cost function proposed in Eq.(2), we can compute the gradient of Q(F, σ) w.r.t. σ with F fixed as F ∗ = arg minF Q.  Ã !2  n n X ∂Q(F, σ) ∂  X F F i j  = µ ||Fj − Tj ||2+ Wij √ −p ∂σ ∂σ j=1 D D ii jj i,j=1  Ã !2  n X Fi ∂ Fj  = Wij √ −p ∂σ D D ii jj i,j=1  #2 # " " n  X ∂Wij Fi Fj Fi Fj √ −p −Wij √ −p =  ∂σ Dii Dii Djj Djj i,j=1   Fi ∂Dii Fj ∂Djj  · p 3 −q (3) Dii ∂σ D3 ∂σ  jj

Using the Gaussian function to measure similarity, we get d2

d2

d2ij exp(− 2σij2 ) ∂ exp(− 2σij2 ) ∂Wij = = ∂σ ∂σ σ3 P d2ij ∂Dii ∂ j Wij X ∂Wij X d2ij exp(− 2σ2 ) = = = ∂σ ∂σ ∂σ σ3 j j

(4)

(5)

3.2. Learning Rate Selection in Gradient Descent In gradient descent, the hyperparameter is updated as σnew = σold − η ∂Q ∂σ |σ=σold ; learning rate η affects the performance of gradient descent severely. In ISM, η is selected dynamically to accelerate convergence of the algorithm. If we fix F , the cost function Q only varies with σ. Therefore, we plot the cost function curve in Fig.2, in which the current value of σ is denoted by point A. In the case where

Learning Rate Selection

2500

Region B Region C Region D Original A

Cost Function Q

2000 1500 1000

D

A*

A C

20

4. If n > N0 , quit iteration and output classification result; else, go to step 2.

B 40

O

60

Sigma

80

100

Fig. 2. Possible regions σ might appear during iteration, where B represents the region between A and the minimum point, O. Regions C and D are separated by point A* satisfying Q(σA∗ ) = Q(σA ). the cost function has multiple local minima, this figure could be considered as a local region of the cost function curve. After updating σ with gradient descent, its new value σ ˆ1 might appear in three regions : B, C or D. In ISM, we hope the cost function decreases monotonically to guarantee the convergence of the algorithm. Also, to simplify our method, we hope the value of σ avoids oscillation. Hence, B is the ideal region for σ ˆ1 , and σ ˆ1 should be near the minimum point O. If σ ˆ1 appears in region C or D, then the learning rate is too large and we set η to equal ηs , the value of which is small enough to guarantee that with ηs , the new variance σ appears in region B. On the other hand, if σ ˆ1 appears in region B, we increase the learning rate by doubling it and calculate σ ˆ2 with the new η. If σ ˆ2 appears in region C or D, we resume the original learning rate and output σ ˆ1 as the variance in this iteration; otherwise, we output σ ˆ2 . 3.3. Implementation of Iterative Smoothness Maximization The implementation details of the ISM algorithm is as following: 1. Initialization. σ = σ0 , total iteration steps N0 , initial learning rate η0 and small learning rate ηs 2 . 2. Calculate the optimal F . ∂Q ∂F = 0 ⇒ Fn+1 = (1 − α)(I − αS(σn ))−1 T where S = D−1/2 W D−1/2 and α = 1/(1 + µ).3 [2] 3. Update σ with gradient descent and adjust learning rate η. σˆ1 = σn − η ∂Q ∂σ |σ=σn ³ ´ (a) If Q(Fn+1 , σˆ1 )
ii. Else σn+1 = σˆ1 , η = η/2 (b) Else η = ηs , σn+1 = σn

500 0 0

Q(Fn+1 ,´σn ) and i. If Q(F ³ n+1 , σ´ˆ2 ) < ³ ∂Q sgn ∂σ |σ=σˆ2 =sgn ∂Q ∂σ |σ=σn , then σn+1=σˆ2

order to guarantee the convergence of the algorithm, ηs is set to be close to 0. 3 The parameter α used in our method is simply fixed at 0.99.[2]

4. CONVERGENCE STUDY OF ITERATIVE SMOOTHNESS MAXIMIZATION Since the algorithm proposed here is iterative, it is crucial to study its convergence property. Without loss of generality, we only consider binary classification here, which can be easily extended to the multi-class case. The cost function could be written as Q(f, σ) = f T (I − S)f + µ(f − t)T (f − t). ∂Q = 0 ⇔ f ∗ = (1 − α)(I − αS)−1 t ∂f

(6)

While the Hessian matrix is as follows: H=

∂2Q = I − S + µI ∂f 2

(7)

∀x ∈ Rn , x 6= 0, xT Hx = xT (I − S)x + µxT x à !2 n n X X 1 1 = Wij √ xi−p xj +µ x2i >0 (8) Dii Djj i,j=1 i=1 Hence, the Hessian matrix is positive definite, moreover, f ∗ is ∗ the unique zero point of ∂Q ∂f . Therefore, f is the global minimum of Q with σ fixed. The property of f ∗ being the global minimum of Q leads to the following inequality: Q(fn+1 , σn ) < Q(fn , σn ), where fn+1 = f ∗ . In step 3 of ISM, σn+1 is ∂Q calculated as σ ˆn+1 = σn − η ∂σ , with the learning rate in n gradient descent selected to guarantee that Q(fn+1 , σn+1 ) < Q(fn+1 , σn ). Thus, in every iteration of ISM, the following inequality is guaranteed: Q(fn+1 , σn+1 ) < Q(fn+1 , σn ) < Q(fn , σn )

(9)

which implies that the cost function Q(σ, f ) decreases monotonically. Moreover, since Q(σ, f ) > 0, Q is lower bounded by zero. Hence, the algorithm is guaranteed to converge. Actually, our method converges after 29 steps of iteration on Two-moon data set with σ0 = 1. 5. EXPERIMENTS We will validate our method on two real world data sets in this section.

5.1. Digit Recognition In this experiment, we will focus on classifying handwritten digits. We use images of digits 1, 2, 3 and 4 from USPS4 handwritten 16 × 16 digits data set and there are 1005, 731, 658 and 652 samples in each class, with a total of 3046. We employ Nearest Neighbor classifier and SVM[9] as baselines. For comparison, we also provide the classification results of Sigma ~ Step of Iteration

2

1

Digit Recognition Accuracy on USPS

0.95 0.9

1.95

0.85 0.8

1.9

0.75

ISM NN LLGC SVM

0.7

1.85

0.65 1.8 0

20

40

(a)

60

80

100

0.6

10

20

(b)

30

40

50

Fig. 3. Classification results on USPS data. (a) Changes of σ during iteration (b) Recognition accuracies, with the horizontal axis representing the number of randomly labeled samples. LLGC method[2] in which the affinity matrix is constructed by a Gaussian function with variance 1.25, which is tuned with grid search. The number of labeled samples increases from 4 to 50 and the test errors averaged over 50 random trials are summarized in Fig.3(b), from which we can clearly see the advantages of ISM and LLGC. And obviously the hyperparameter in LLGC resulted from grid search is suboptimal.

In this section, we will address the task of object recognition using the COIL-205 data set, a database of gray-scale images of 20 objects. For each object there are totally 72 images of size 128×128. Similar as digit recognition, we employ NearSigma ~ Step of Iteration

1

68

Object Recognition Accuracy on COIL−20

0.9

66

0.8

64 62

0.7 ISM NN LLGC SVM

60

0.6

58 56 0

10

20

(a)

30

40

6. CONCLUSIONS AND FUTURE WORKS Employing the idea of gradient descent, we propose the Iterative Smoothness Maximization algorithm in this paper, to learn an optimal graph automatically for a semi-supervised learning task. Moreover, the algorithm is guaranteed to converge. Experimental results on both toy and real-world data sets show the effectiveness of ISM for parameter selection when only few labeled samples are provided. Although we focus on learning hyperparameter σ in the Gaussian function, our method can be applied to other kernel functions as well. Besides the significant advantages of ISM, there are still certain aspects that we can research more to improve the efficiency of our method. This includes employing more advanced optimization algorithms to speed up convergence and extending ISM to allow the use of different σ along different directions. 7. ACKNOWLEDGEMENT This work is supported by the project (60675009) of the National Natural Science Foundation of China. 8. REFERENCES

5.2. Object Recognition

70

18 and the test errors averaged over 50 random trials are summarized in Fig.4(b), from which we can clearly see the advantages of ISM and LLGC, while in ISM the hyperparameter σ is determined automatically.

50

0.5

5

10

(b)

15

Fig. 4. Object recognition results on COIL-20 data. (a) Changes of σ during iteration (b) Recognition accuracies, with the horizontal axis representing the number of randomly labeled samples per class. est Neighbor classifier and SVM as baselines. For comparison, we also test the recognition accuracy of LLGC method in which the affinity matrix is constructed by a Gaussian function with variance 40, which is also tuned with grid search. The number of labeled samples per class increases from 2 to 4 http://www.kernel-machines.org/data.html 5 http://www1.cs.columbia.edu/CAVE/software/softlib/coil-100.php

[1] X. Zhu, “Semi-supervied learning literature survey,” Computer Sciences Technical Report, 1530, University of WisconsinMadison, 2006. [2] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Scholkopf, “Learning with local and global consistency,” Advances in Neural Information Processing Systems, 2003. [3] M. Belkin, P. Niyogi, and V. Sindhwani, “On manifold regularization,” Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics, 2005. [4] O. Chapelle, B. Scholkopf, and A. Zien, Semisupervised Learning, MIT Press: Cambridge, MA, 2006. [5] O. Chapelle, J. Weston, and B. Scholkopf, “Cluster kernels for semi-supervised learning,” Advances in Neural Information Processing Systems 15, 2003. [6] X. Zhu, Semi-Supervised Learning with Graphs, Doctoral thesis, Carnegie Mellon University, May 2005. [7] O. Delalleu, Y. Bengio, and N. Le Roux, “Non-parametric function induction in semi-supervised learning,” Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics, 2005. [8] L. Zelnik-Manor and P. Perona, “Self-tuning spectral clustering,” Advances in Neural Information Processing Systems, 2004. [9] B. Scholkopf and A. Smola, Learning with Kernels, MIT Press, Cambridge, MA, 2006.

SMOOTHNESS MAXIMIZATION VIA GRADIENT ...

State Key Laboratory of Intelligent Technologies and Systems,. Department of ..... 5http://www1.cs.columbia.edu/CAVE/software/softlib/coil-100.php. 18 and the ...

218KB Sizes 4 Downloads 157 Views

Recommend Documents

Smoothness Maximization via Gradient Descents
Iterate between the above two steps until convergence ... to guarantee the convergence. Also ... X. Zhu, Semi-supervied learning literature survey, Computer Sci-.

GRADIENT IN SUBALPINE VVETLANDS
º Present address: College of Forest Resources, University of Washington, Seattle .... perimental arenas with alternative prey and sufficient habitat complexity ...... energy gain and predator avoidance under time constraints. American Naturalist ..

Motion Segmentation by Spatiotemporal Smoothness ...
smoothness both in the spatial and temporal domains si- ... 1. Introduction. Motion segmentation is one of the most important as- pects of video sequence analysis, and serves ..... cent layers are tested for merging, and based on those pair-.

A Soft Edge Smoothness Prior for Color Image Super-Resolution
May 21, 2009 - and unsupervised solution for by the spectral clustering tech- nique. Thus ...... from the Georgia Institute of Technology, Atlanta, in. 1981 and ...

Repetition Maximization based Texture Rectification
Figure 1: The distorted texture (top) is automatically un- warped (bottom) using .... however, deals in world-space distorting and not with cam- era distortions as is ...

Repetition Maximization based Texture Rectification
images is an essential first step for many computer graph- ics and computer vision ... matrix based rectification [ZGLM10] can be very effective, most of our target ...

SPARSITY MAXIMIZATION UNDER A ... - Semantic Scholar
This paper considers two problems in sparse filter design, the first in- volving a least-squares ..... We used a custom solver for the diagonal relaxation; ... sparse FIR filters using linear programming with an application to beamforming,” IEEE ..

Expected Sequence Similarity Maximization - Semantic Scholar
ios, in some instances the weighted determinization yielding Z can be both space- and time-consuming, even though the input is acyclic. The next two sec-.

Path-Constrained Influence Maximization in ...
marketers may want to find a small number of influential customers and ...... AT&T. 42 should be different under different targeting node types. While in real-world applications, even we have the same ... search fields related to their business.

Throughput Maximization for Opportunistic Spectrum ...
Aug 27, 2010 - Throughput Maximization for Opportunistic. Spectrum Access via Transmission. Probability Scheduling Scheme. Yang Cao, Daiming Qu, Guohui Zhong, Tao Jiang. Huazhong University of Science and Technology,. Wuhan, China ...

Extended Expectation Maximization for Inferring ... - Semantic Scholar
uments over a ranked list of scored documents returned by a retrieval system has a broad ... retrieved by multiple systems should have the same, global, probability ..... systems submitted to TREC 6, 7 and 8 ad-hoc tracks, TREC 9 and 10 Web.

Synthesis, temperature gradient interaction ...
analysis of these combs, providing directly the distribution of the number of arms on the synthesised ..... data sets are shifted vertically by factors of 10 for clarity.

Path-Constrained Influence Maximization in ...
For example, mobile phone marketers may want to find a small number of influential customers and give them mobile phones for free, such that the product can ...

PENALTY FUNCTION MAXIMIZATION FOR LARGE ...
state-level Hamming distance versus a negative phone-level ac- curacy. Indeed ... The acoustic features for the English system are 40-dimensional vectors obtained via an .... [3] F. Sha and L. Saul, “Comparison of large margin training to other ...

Welfare Maximization in Congestion Games
We also describe an important and useful connection between congestion .... economic literature on network formation and group formation (see, e.g., [22, 13]). .... As a last step, we check that this result cannot be outperformed by a trivial.

Throughput Maximization for Opportunistic Spectrum ... - IEEE Xplore
Abstract—In this paper, we propose a novel transmission probability scheduling scheme for opportunistic spectrum access in cognitive radio networks. With the ...

Robust Utility Maximization with Unbounded Random ...
pirical Analysis in Social Sciences (G-COE Hi-Stat)” of Hitotsubashi University is greatly ... Graduate School of Economics, The University of Tokyo ...... Tech. Rep. 12, Dept. Matematica per le Decisioni,. University of Florence. 15. Goll, T., and

Evolution of Movement Smoothness and Submovement ...
successful decomposition of movement data into submovements may produce suffi- cient evidence .... 2 Movement smoothness changes during stroke recovery. 23 ..... While there is general agreement in the medical and therapy fields that this.

A Soft Edge Smoothness Prior for Color Image Super-Resolution
Apr 10, 2009 - borhood system represented by a set of vectors. , where is the neighborhood order, and the. 's are chosen as the relative position (taking integer values as its components, and the unit is the grid interval) of the nearest neighbors wi

Synthesis, temperature gradient interaction ...
2Department of Chemistry and Center for Integrated Molecular Systems,. Pohang .... For the normal phase temperature gradient interaction chromatography (NP-TGIC) analysis, a ..... data sets are shifted vertically by factors of 10 for clarity.

An Urban-Rural Happiness Gradient
Abstract. Data collected by the General Social Survey from 1972 to 2008 are used to confirm that in the United States, in contrast to many other parts of the world, there is a gradient of subjective wellbeing (happiness) that rises from its lowest le

Robust Maximization of Asymptotic Growth under ... - CiteSeerX
Robust Maximization of Asymptotic Growth under Covariance Uncertainty. Erhan Bayraktar and Yu-Jui Huang. Department of Mathematics, University of Michigan. The Question. How to maximize the growth rate of one's wealth when precise covariance structur

SPARSITY MAXIMIZATION UNDER A ... - Semantic Scholar
filters, as a means of reducing the costs of implementation, whether in the form of ... In Section. 2 it is shown that both problems can be formulated in terms of a.

Robust Maximization of Asymptotic Growth under ... - CiteSeerX
Conclusions and Outlook. Among an appropriate class C of covariance struc- tures, we characterize the largest possible robust asymptotic growth rate as the ...