Sparse distance metric learning for embedding compositional data

Zachary D. Kurtz ZACHARY. KURTZ @ MED . NYU . EDU Sackler Institute of Graduate Biomedical Sciences, NYU School of Medicine, New York, NY 10016 Christian L. Müller CMUELLER @ SIMONSFOUNDATION . ORG Richard A. Bonneau RBONNEAU @ SIMONSFOUNDATION . ORG Simons Center for Data Analysis, Simons Foundation, New York, NY 10011

Abstract We propose a novel method for distance metric learning and generalized Aitchison embedding of multi-class data on the (p − 1)-simplex. We consider the p > n setting where p is the number of variables and n the number of data points. This problem setup is motivated by common learning tasks on data sets arising in microbial ecology, where the relative abundance of p species is measured across n environments or patients. Our approach can specifically handle data that contain a large number of zero measurements (zero inflation), a common property for data acquired from targeted high throughput and single-cell sequencing.

to the underlying sparsity of individual samples xk we adopt a novel regularization framework for distance metric learning with the following (squared) Mahalanobis distance (Mateu-Figueras et al., 2013): d2M (xi , xj )

T

=

(a(xi ) − a(xj )) Σ−1 (a(xi ) − a(xj ))

=

∆Tij Q∆ij ,

where Σ denotes the covariance matrix of the Aitchison embedded variables, ∆ij = (log(xi + bi ) − log(xj + bj )), and Q = PT ΛP the Mahalanobis metric of the Aitchison embedding with Λ the covariance of the log-transformed data.

We are given n data pairs (xk , yk ) where x are pdimensional compositional data vectors that are restricted p−1 to simplex S+ = {x = (x1 , . . . , xi , . . . , xp ): xi ≥ 0, Pthe p i=1 xi = 1}, and each vector xk is associated with a class label yk ∈ {1, . . . , K}. We seek to learn an embedding of the compositional data in Euclidean space under the constraint that distances between xi and xj are small when yi = yj and large for all data points across classes to improve multi-class learning using k-Nearest Neighbor classification (Weinberger & Saul, 2009).

Let X = (x1 , . . . , xi , . . . , xn ) ∈ Rp×n be the given data p×n the matrix of unand B = (b1 , . . . , bi , . . . , bn ) ∈ R+ known pseudocounts. Our novel regularized distance metric learning approach is based on the following non-convex optimization problem:  min − log det Q + tr S T Q

We propose to use Generalized Aitchison Embeddings (GAEs) (Le & Cuturi, 2013) of the form

where S = log(X + B) log(X + B)T and D is the set of dissimilar pairs. The `1 -penalty on the positive definite matrix Q promotes sparsity, which is a common structural assumption in the p > n regime (Friedman √ et al., 2008). Similarly, the nuclear norm kBk∗ = tr BB T promotes low-rank structure in the pseudocount matrix, implying that there is a low-dimensional subspace that can accurately capture the variability of the sample sparsity. The positive scalar quantities s and r are tuning parameters that require data-driven selection.

m

a(x) ≡ P log(x + b) ∈ R , where P ∈ Rm×p , m ≤ p, is a projection matrix and b ∈ Rp+ a positive vector of “pseudocounts". (Le & Cuturi, 2013; 2014) cast the objective of learning the matrix P and vector b as a metric learning problem for the Euclidean distance between two compositions under an Aitchison embedding: da (xi , xj )

= dE (a(xi ), a(xj )) = kP log(xi + b) − P log(xj + b)k2 .

To (i) learn embeddings for high-dimensional compositions when p > n and (ii) adapt the vector of pseudocounts

Q0,B∈Rp×n +

X

dM (xi , xj ) > 1, kQk1 ≤ s, kBk∗ ≤ r

(xi ,xj )∈D

To efficiently solve the proposed optimization problem to local optimality we develop a novel iterative SplitBregman-type algorithm. We show the validity and superior performance of our novel embedding method on a variety of synthetic benchmarks as well as real-world microbial abundance data across different habitats.

Sparse distance metric learning for embedding compositional data

References Friedman, Jerome, Hastie, Trevor, and Tibshirani, Robert. Sparse inverse covariance estimation with the graphical lasso. Biostatistics (Oxford, England), 9(3):432–441, 2008. Le, Tam and Cuturi, M. Generalized Aitchison Embeddings for Histograms. Asian Conference on Machine Learning, (2011):293–308, 2013. URL http://jmlr.org/proceedings/papers/ v29/Le13.html. Le, Tam and Cuturi, Marco. Adaptive Euclidean maps for histograms: generalized Aitchison embeddings. Machine Learning, aug 2014. ISSN 0885-6125. doi: 10.1007/s10994-014-5446-z. URL http://link.springer.com/10.1007/ s10994-014-5446-z. Mateu-Figueras, G, Pawlowsky-Glahn, Vera, and Juan JosÂŽ e Egozcue. The normal distribution in some constrained sample spaces. SORT, 37(1):29–56, 2013. URL http://arxiv.org/abs/0802.2643. Weinberger, K Q and Saul, L K. Distance metric learning for large margin nearest neighbor classification. The Journal of Machine Learning Research, 10:207–244, 2009. ISSN 1532-4435. URL http: //machinelearning.wustl.edu/mlpapers/ paper{_}files/NIPS2005{_}265.pdf.

Sparse distance metric learning for embedding compositional data

Simons Center for Data Analysis, Simons Foundation, New York, NY 10011. Abstract. We propose a novel method for distance metric learning and generalized ...

179KB Sizes 11 Downloads 278 Views

Recommend Documents

Learning a Mahalanobis Distance Metric for Data ...
Nov 28, 2008 - plied a suitable distance metric, through which neighboring data points can be ...... in: SIGKDD Workshop on Multimedia Data Mining: Mining ...

Sparse Distance Learning for Object Recognition ... - Washington
objects, we define a view-to-object distance where a novel view is .... Google 3D Warehouse. ..... levels into 18 (0◦ −360◦) and 9 orientation bins (0◦ −180◦),.

Factor-based Compositional Embedding Models
Human Language Technology Center of Excellence. Center for .... [The company]M1 fabricates [plastic chairs]M2 ... gf ⊗ hf . We call efi the substructure em-.

Learning a Distance Metric for Object Identification ...
http://citeseer.ist.psu.edu/mostcited.html ... Using D × D matrix A = {ai,j}, we ... ai,j(x m i − xn i. )(x m j − xn j. ) ⎞. ⎠. 1. 2 . The necessary and sufficient condition for ...

Grouplet-based Distance Metric Learning for Video ...
types of distances can be computed using individual grouplets, and through the ... data point x3 aggregated BoW (sum): 5 aggregated BoW (sum): 5 aggregated BoW (sum): 5. Figure 1. An example of the aggregated BoW feature based on a grouplet. ..... â€

Hyperparameter Learning for Kernel Embedding ...
We verify our learning algorithm on standard UCI datasets, ... We then employ Rademacher complexity as a data dependent model complexity .... probability of the class label Y being c when the example X is x. ..... Mining, pages 298–307.

A Hybrid Method for Distance Metric Learning - Edward Yi-Hao Kao
May 26, 2011 - ... given features of an object, searches a database for similar objects. ..... by normalized discounted cumulative gain at position 10 ( NDCG10),.

A Hybrid Method for Distance Metric Learning - Edward Yi-Hao Kao
May 26, 2011 - Experiments with synthetic data as well as a real medical image retrieval problem demonstrate ..... Chapman & Hall/CRC, 2000. A. Frome, Y.

Multi-Embedding and Path Approximation of Metric ...
is a linear combination of the distances. B. : a class of metric spaces for which. 6 is “easy". Suppose. CDF E. , an algorithm to solve. 6 for metrics in. B . Suppose.

A Sparse Embedding and Least Variance Encoding ... - IEEE Xplore
Abstract—Hashing is becoming increasingly important in large-scale image retrieval for fast approximate similarity search and efficient data storage.

Distance-Learning-Committee_2016 ...
Connect more apps... Try one of the apps below to open or edit this item. Distance-Learning-Committee_2016-2017_Membership_Updated_May-17-2016.pdf.

Embedding Edit Distance to Allow Private Keyword Search in Cloud ...
need to a priori define the set of words which are considered as acceptable for ... able Encryption scheme for edit distance by adapting the model from [7]. ... The context is Cloud Computing where users can either store or retrieve data from the ...

Sparse Preference Learning
and allows efficient search for multiple, near-optimal solutions. In our experi- ... method that can lead to sparse solutions is RankSVM [6]. However ..... IOS Press.

Learning a Large-Scale Vocal Similarity Embedding for Music
ommendation at commercial scale; for instance, a system similar to the one described ... 1Spotify Inc.. ... sampled to contain a wide array of popular genres, with.

Learning Subspace Conditional Embedding Operators - Intelligent ...
Department of Computer Science, Computational Learning for Autonomous Systems (CLAS),. Technische ..... and applying the matrix identity A (BA + λI). −1. =.

Learning Subspace Conditional Embedding Operators - Intelligent ...
Department of Computer Science, Computational Learning for Autonomous Systems (CLAS),. Technische Universität ... A well known method for state estimation and prediction ..... get good estimations of the conditional operators for highly.

Maximum Distance Separable Codes in the ρ Metric ...
Jun 22, 2011 - For a given A = (a1,...,an), with 0 ≤ aj ≤ s, define an elementary box centered at. Ω ∈ Matn,s(A) by. VA(Ω) = {Ω/ | ρ(ωi,ω/ i) ≤ ai, A = (a1,...,an)}.

Zero-distortion lossless data embedding
This way, the histograms of the host-data and the embedded data do not overlap ..... include comparison with other methods, transform domain embedding and.