Smoothness Maximization via Gradient Descents Author: Bin Zhao, Fei Wang, Changshui Zhang Affiliation: Department of Automation, Tsinghua University, Beijing 100084, P.R.China Email:
[email protected],
[email protected],
[email protected]
Semi-Supervised Learning
Iterative Smoothness Maximization
Learning from partially labeled data
Convergence Study
• Smoothness maximization:
• Transductive Learning
∂Q = 0 ⇒ F = (1 − α)(I − αS)−1T ∂F
• Inductive Learning
• Smoothness Maximization: (3)
where S = D −1/2W D −1/2 and α = 1/(1 + µ) • Graph reconstruction: optimize the objective function Q w.r.t. σ via gradient descent
Graph Based Semi-Supervised Learning
∂ 2Q H= = I − S + µI 2 ∂f
∀x ∈ Rn, x 6= 0, xT Hx = xT (I − S)x + µxT x !2 n n X X 1 1 x2i > 0 (8) xj + µ = Wij √ xi − p Dii Djj
• Iterate between the above two steps until convergence
• E: edge set, each edge eij ∈ E associated with a weight wij = exp (−kxi − xj k2/(2σ 2))
Toy Data (Two−moon)
2
unlabeled point labeled point +1 labeled point −1
1.5
1.5
1
1
0.5
0.5
0
0
−0.5
−0.5
−1
−1
−1.5 −2
−1
0
1
(a) 2
d2ij 2σ 2
d2ij 2σ 2
∂ exp(− ) d2ij exp(− ) ∂Wij = = ∂σ ∂σ σ3 2 P d 2 exp(− ij ) X X d W ∂ ∂Wij ∂Dii j ij ij 2σ 2 = = = ∂σ ∂σ ∂σ σ3 j
−1.5 3 −2
2
Gradient Computing
Compute the gradient of Q(F, σ) w.r.t. σ with F fixed as F ∗ = arg minF Q " " #2 # n Fj Fj Fi ∂Q(F, σ) X ∂Wij Fi √ = −p −p −Wij √ ∂σ ∂σ Dii Dii Djj Djj i,j=1 ∂D F F ∂D jj j ii · q i −q (4) 3 ∂σ 3 ∂σ Djj Dii
Classification Result with Sigma=0.15
2
−1
0
(b)
Hence, the Hessian matrix is positive definite, moreover, f ∗ is the ∗ is the global minimum of Q unique zero point of ∂Q . Therefore, f ∂f with σ fixed. Q(fn+1, σn) < Q(fn, σn) (9)
(1)
The Influence of σ
1
2
3
i=1
i,j=1
Represent the dataset as an weighted undirected graph G =< V, E > • V: node set corresponding to the labeled and unlabeled examples
(7)
• Graph Reconstruction: learning rate η in gradient descent guarantee Q(fn+1, σn+1) < Q(fn+1, σn)
(10)
Hence, objective function Q decreases monotonically and is lower bounded by 0. The algorithm is guaranteed to converge.
Digit Recognition
(5)
Sigma ~ Step of Iteration
2
1
(6)
j
0.95 0.9
1.95
0.85 0.8
1.9
Classification Result with Sigma=0.4
Digit Recognition Accuracy on USPS
0.75 0.7
1.85
1.5
Learnign Rate Selection
1 0.5
ISM NN LLGC SVM
0.65 1.8 0
20
40
(a)
60
80
100
0.6
10
20
(b)
30
40
50
0 −0.5 −1 −1.5 −2
−1
0
(c)
1
2
3
• σ is updated as σnew = σold − η ∂Q ∂σ |σ=σold • learning rate η affects the performance of gradient descent severely
• In ISM, η is selected dynamically to accelerate convergence of the algorithm. We hope the objective function decreases monotonically to guarantee the convergence. Also, to simplify our method, we hope the value of σ avoids oscillation.
Graph Construction
Object Recognition
70
Sigma ~ Step of Iteration
1
68
0.9
66
0.8
64 62
• Graph construction is at the heart of GBSSL. The final classification result is significantly affected by σ • No reliable approach that can determine an optimal σ automatically so far
Implementation of Iterative Smoothness Maximization
Objective Function
• Tij = 1 if xi is labeled as ti = j and Tij = 0 otherwise • Classification result represented as F = [F1T , . . . , FnT ]T , which determines the label of xi by ti = arg max1≤j≤c Fij P • W is the weight matrix, and Dii = j Wij
2 n n
F
X F 1X
i j √ p Q= Wij kFi − Tik2 (2) −
+µ
Dii 2 Djj i=1
• First term measures label smoothness • Second term measures label fitness
• µ > 0 adjust the tradeoff between these two terms
0.7 ISM NN LLGC SVM
60
0.6
58 56 0
10
20
(a)
30
40
50
0.5
5
10
(b)
15
1. Initialization. σ = σ0, total iteration steps N0, initial learning rate η0 and small learning rate ηs 2. Calculate the optimal F .
i,j=1
Object Recognition Accuracy on COIL−20
3. Update σ with gradient descent and adjust learning rate η. σˆ1 = σn − η ∂Q ∂σ |σ=σn |σ=σˆ1 (a) If Q(Fn+1, σˆ1) < Q(Fn+1, σn) and sgn ∂Q ∂σ ∂Q | = sgn ∂Q , then η = 2η, σ ˆ = σ − η n 2 ∂σ σ=σn ∂σ |σ=σn . i. If Q(Fn+1, σˆ2) < Q(Fn+1, σn) and ∂Q sgn ∂Q | = sgn ∂σ σ=σˆ2 ∂σ |σ=σn , then σn+1 = σˆ2
ii. Else σn+1 = σˆ1, η = η/2 (b) Else η = ηs, σn+1 = σn
4. If n > N0, quit iteration and output classification result; else, go to step 2.
References
• D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Scholkopf, Learning with local and global consistency, Advances in Neural Information Processing Systems, 2003. • M. Belkin, P. Niyogi, and V. Sindhwani, On manifold regularization, Proceedings of the 10th International Workshop on Artificial Intelligence and Statistics, 2005. • X. Zhu, Semi-supervied learning literature survey, Computer Sciences Technical Report, 1530, University of Wisconsin- Madison, 2006. • L. Zelnik-Manor and P. Perona, Self-tuning spectral clustering, Advances in Neural Information Processing Systems, 2004.