Improved deep metric learning with multi-class N-pair loss objective Kihyuk Sohn1 and Wendy Shang2 1NEC
Overview f+ f
+ fN-2
...
.. f ..
f-
2 + fN-1
f1 f2+
f4+ f+
DNN
f1
f+N-1
N-2
f
N-pair loss: deep metric learning with multiple negatives f2
f1
f1+ f2+
f1f2-
f1 f2
f1+ f2+
f
+ 1
f1,-
f1,-
1
2
f2,-
f2,-
1
2
f1,N-1
f1
f1+
f2,N-1
f2
f2+
f3
f3+
fN
fN+
f2+
f3+
f4+ f-
f3+
f2+ fN
q Deep metric learning: learning a distance metric via deep learning. q Existing frameworks utilize only weak supervision (e.g., same/not same)
fN+
fN-
(a) triplet
fN
fN+
fN,1
fN,2
(b) (N+1)-tuplet
fN,N-1
f1+
f2+
+ fN-1
(c) N-pair
q Learning to identify from multiple negative examples § (N+1)-tuplet loss identifies a positive example from N negative examples.
§ contrastive loss [1] § triplet loss [2,3] § require carefully-designed negative data mining, but is expensive for deep network. q We propose a novel deep metric learning framework § that allows joint comparison to multiple negatives, § while reducing the computational burden via efficient batch construction. q We demonstrate the superiority of our proposed loss to the triplet loss on several visual recognition benchmark, including fine-grained object recognition and verification, image clustering and retrieval, and face verification and identification.
Experimental results
3-2
Face verification and identification
fN+
f3+
f
.. x ..
f
+ 1
...
1
Labs, 2Oculus VR
§ Triplet loss is a special case when N=2. q Efficient batch construction via N-pair examples: O(N2) è O(N) § N tuples of (N+1)-tuplet loss requires N(N+1) examples to be evaluated. § We can obtain N tuples of (N+1)-tuplet loss by constructing a batch with N pairs whose pairs are from different classes. This requires only 2N examples to be evaluated. q Multi-class N-pair (N-pair-mc) loss and One-vs-one N-pair loss (N-pair-ovo)
q Experimental setting § Dataset: train on CASIA WebFace database [9] (500k images of 10k identities) and evaluate on LFW database [10] § Training: 1. CASIA network from scratch (10 conv. + 4 max and 1 average pooling). 2. 384 examples per batch, which corresponds to 192-pair loss. 3. For 320 and 720-pair loss models, batch size is increased accordingly. 4. We replace conv + max pooling into conv with stride [11] to reduce GPU memory usage for 720-pair loss model. § Testing: 1. Verification: determine whether two images are the same person or not. 2. Closed-set identification (rank-1): reference example exists. 3. Open-set identification (DIR@FIR=1%): reference may not exist. q Performance
§ N-pair-mc loss models consistently improve over triplet or N-pair-ovo loss models. § The performance gap becomes more significant on identification tasks. § Increasing N by increasing the batch size is important.
References [1] Chopra et al. Learning a similarity metric discriminatively, with application to face verification. In CVPR, 2005. [2] Weinberger et al. Distance metric learning for large margin nearest neighbor classification. In NIPS, 2005. [3] Schroff et al. FaceNet: A unified embedding for face recognition and clustering. In CVPR, 2015. [4] Xie et al. Hyper-class augmented and regularized deep learning for fine-grained imageclassification. In CVPR, 2015. [5] Szegedy et al. Going deeper with convolutions. In CVPR, 2015. [6] Song et al. Deep metric learning via lifted structured feature embedding. In CVPR, 2016. [7] Krause et al. 3d object representations for fine-grained categorization. 2013. [8] Wah et al. The Caltech-UCSD Birds-200-2011 Dataset. Technical Report CNS-TR-2011-001, California Institute of Technology, 2011. [9] Yi et al. Learning face representation from scratch. CoRR,abs/1411.7923, 2014. [10] Huang et al. Towards unconstrained face recognition. In CVPR Workshop, 2008. [11] Springenberg et al. Striving for simplicity: The all convolutional net, In ICLR Workshop, 2015
q Convergence analysis q Hard negative class mining § When output space is small, N-pair loss doesn’t require hard negative data mining. § When output space is large, we propose to find hard negative “classes”.
Experimental results
3-1
Fine-grained visual object recognition & verification q Experimental setting § Dataset: Car-333 [4] (165k images of 333 car models) and Flower-610 (62k images of 610 flower species). § Training: 1. initialized from ImageNet pretrained GoogLeNet [5]. 2. 144 examples per batch, which corresponds to 72-pair loss. § Testing: 1. Classification: k-nearest neighbor classifier using cosine similarity. 2. Verification: rank-1 accuracy of query from single positive and multiple negatives. q Performance
100"
4.5"
90"
4"
tri,"tri" tri,"192" 192p1ovo,"tri"
80"
192p1ovo,"192"
3.5"
192p1mc,"tri"
70"
3"
60"
192p1mc,"ovo"
2.5"
50" 2"
40"
Visual recognition of unseen object classes q Experimental setting § Dataset: online product [6] (120k images of 23k online product categories), Car-196 [7] (16k images of 196 car models), and CUB-200 [8] (12k images of 200 bird species). § Procedure: divide dataset based on object categories, i.e., no overlapping categories between train and test. § Training: 1. initialized from ImageNet pretrained GoogLeNet. 2. 120 examples per batch, which corresponds to 60-pair loss. § Testing: 1. Clustering: F1 and NMI (normalized mutual information). 2. Retrieval: recall@K score at different K’s. q Performance
tri,"tri" 30"
1.5"
tri,"192" 192p1ovo,"tri"
20"
1"
192p1ovo,"192" 192p1mc,"tri"
10"
0.5"
192p1mc,"ovo" 0"
0" 0"
40000"
-
80000"
120000"
160000"
200000"
240000"
0"
40000"
80000"
120000"
160000"
200000"
240000"
Training accuracy (left) and loss (right) curves of triplet, 192-pair-ovo, and 192-pair-mc loss models. Triplet and 192-way classification accuracy and loss are plotted.
§ The difference in learning curve becomes apparent for 192-way classification measure. § 192-pair-mc loss model reaches the final accuracy of triplet loss models after 15k updates, and that of 192-pair-ovo loss model after 20k updates. q Importance of N-pair batch construction § NxM batch construction, where N is the number of distinct classes and M is the number of example for each class (e.g., N-pair = Nx2). § NCA-like loss function:
-
200-dim embedding vectors are connected on top of pool5 features of GoogLeNet. Additional fully connected layer is connected for training softmax loss models. : recognition with softmax classifier.
§ 72-pair loss models improve over triplet models, even with negative data mining. § 72-pair-mc loss models improve upon 72-pair-ovo loss models. § Softmax loss models are good at classification, but poor at verification.
§ Performance: § Similarly to previous, 60-pair loss models consistently improve over triplet models and multi-class loss models outperform one-vs-one loss models. § Negative class mining is effective for online product dataset, where the number of classes (>11k) is significantly larger than N (=60).
§ N-pair loss achieves utmost performance among other variants. § Although there exist performance drop, NxM loss models are still significantly better than triplet loss models.