Compressed knowledge transfer via factorization machine for ...

Viewer
Transcript

Knowledge-Based Systems 85 (2015) 234–244

Contents lists available at ScienceDirect

Knowledge-Based Systems journal homepage: www.elsevier.com/locate/knosys

Compressed knowledge transfer via factorization machine for heterogeneous collaborative recommendation Weike Pan a, Zhuode Liu a, Zhong Ming a,⇑, Hao Zhong b, Xin Wang b, Congfu Xu b a b

College of Computer Science and Software Engineering, Shenzhen University, China Institute of Artiﬁcial Intelligence, College of Computer Science, Zhejiang University, China

a r t i c l e

i n f o

Article history: Received 11 August 2014 Received in revised form 10 February 2015 Accepted 8 May 2015 Available online 15 May 2015 Keywords: Collaborative recommendation Heterogeneous feedbacks Factorization machine Compressed knowledge Transfer learning

a b s t r a c t Collaborative recommendation has attracted various research works in recent years. However, an important problem setting, i.e., ‘‘a user examined several items but only rated a few’’, has not received much attention yet. We coin this problem heterogeneous collaborative recommendation (HCR) from the perspective of users’ heterogeneous feedbacks of implicit examinations and explicit ratings. In order to fully exploit such different types of feedbacks, we propose a novel and generic solution called compressed knowledge transfer via factorization machine (CKT-FM). Speciﬁcally, we assume that the compressed knowledge of user homophily and item correlation, i.e., user groups and item sets behind two types of feedbacks, are similar and then design a two-step transfer learning solution including compressed knowledge mining and integration. Our solution is able to transfer high quality knowledge via noise reduction, to model rich pairwise interactions among individual-level and cluster-level entities, and to adapt the potential inconsistent knowledge from implicit feedbacks to explicit feedbacks. Furthermore, the analysis on time complexity and space complexity shows that our solution is much more efﬁcient than the state-of-the-art method for heterogeneous feedbacks. Extensive empirical studies on two large data sets show that our solution is signiﬁcantly better than the state-of-the-art non-transfer learning method w.r.t. recommendation accuracy, and is much more efﬁcient than that of leveraging the raw implicit examinations directly instead of compressed knowledge w.r.t. CPU time and memory usage. Hence, our CKT-FM strikes a good balance between effectiveness and efﬁciency of knowledge transfer in HCR. Ó 2015 Elsevier B.V. All rights reserved.

1. Introduction Recommendation functionality has been widely implemented as a default module in various Internet services such as YouTube’s video recommendation and Amazon’s book recommendation. Factorization based collaborative recommendation algorithms with low-rank assumptions have dominated in various recommendation scenarios due to their applicability and high accuracy. Most factorization based methods focus on homogeneous user feedbacks, e.g., explicit ratings in matrix factorization [28,31] and implicit feedbacks in Bayesian personalized ranking (BPR) [27]. However, few works have studied a very common problem setting, in which ‘‘a user examined several items but only rated a few’’. This setting is called heterogeneous collaborative recommendation (HCR) and considers different types of users’ feedbacks, ⇑ Corresponding author. E-mail addresses: [email protected] (W. Pan), [email protected] (Z. Liu), [email protected] (Z. Ming), [email protected] (H. Zhong), cswangxinm@zju. edu.cn (X. Wang), [email protected] (C. Xu). http://dx.doi.org/10.1016/j.knosys.2015.05.009 0950-7051/Ó 2015 Elsevier B.V. All rights reserved.

including implicit examinations (e.g., browsing and clicks) and explicit ratings. In a typical recommendation system, implicit feedbacks are usually more abundant and thus have a potential to help alleviate the sparsity problem of users’ explicit ratings. For the HCR problem, the most well-known method is probably the SVD++ model [11], which could be mimicked by factorization machine (FM) [24]. SVD++ and FM combine two types of feedbacks in a principled way via changing the prediction rule deﬁned on one (user, item, rating) triple in explicit feedbacks to that deﬁned on both the triple and all examined items by the user. Because implicit feedbacks are usually much more than explicit feedbacks, leveraging raw implicit feedbacks will increase the time cost and space cost signiﬁcantly, which may make it not applicable in some real-world recommendation scenarios. The increase of time and space cost is also observed in our empirical studies in Section 4. In order to leverage the implicit feedbacks in a more efﬁcient and effective way, we address the HCR problem from a novel transfer learning perspective [19], in which we take explicit feedbacks as target data and implicit feedbacks as auxiliary data. Technically, we propose a novel two-step transfer learning

235

W. Pan et al. / Knowledge-Based Systems 85 (2015) 234–244

solution, i.e., compressed knowledge transfer via factorization machine (CKT-FM), for knowledge sharing between auxiliary data and target data. In our ﬁrst step, we mine compressed knowledge of user homophily (i.e., user groups) and item correlation (i.e., item sets) from auxiliary implicit feedbacks, which is expected to be more parsimonious than the raw implicit feedbacks. In our second step, we design an integrative knowledge transfer solution via expanding the design matrix of factorization machine, which incorporates the compressed knowledge of user groups and item sets into the target data in a seamless manner. We then conduct extensive empirical studies on two large data sets and obtain signiﬁcantly better results via our CKT-FM than the state-of-the-art method without knowledge transfer. Furthermore, our CKT-FM is much more efﬁcient than the method leveraging raw implicit examinations w.r.t. CPU time and memory usage. We summarize our main contributions as follows, (i) we propose a novel and generic compressed knowledge transfer solution via factorization machine (CKT-FM) for heterogeneous collaborative recommendation; and (ii) we conduct extensive empirical studies on two large data sets and show that our CKT-FM is significantly better than the state-of-the-art non-transfer learning method w.r.t. recommendation accuracy, and is much more efﬁcient than that of leveraging the raw implicit examinations directly instead of compressed knowledge w.r.t. CPU time and memory usage. We organize the paper as follows. First, we provide some background information of a formal deﬁnition of the studied problem and a description of factorization machine in Section 2. Second, we describe our two-step knowledge transfer solution in detail in Section 3. Third, we conduct extensive empirical studies and detailed analysis in Section 4. Fourth, we discuss some existing works on some closely related topics in Section 5. Finally, we conclude this paper with some future directions.

and auxiliary data, which can thus be categorized as a frontal-side transfer learning setting [23], rather than the two-side [22], user-side [7] or item-side [29] knowledge transfer setting. We illustrate our studied problem in Fig. 1, in particular of the left part (implicit examinations) and the right part (explicit ratings). We put some commonly used notations in Table 1. They include (i) feedbacks, (ii) latent variables, (iii) compressed knowledge, (iv) pairwise interactions, and (v) variables in factorization machine. Please refer to this table for the descriptions of the notations used in this paper. 2.2. Factorization machine The main idea of factorization machine (FM) [24] is to represent the (user, item) rating matrix R in a new form, including a design matrix X and a target vector r,

FMðRÞ ! FMðX; rÞ: Speciﬁcally, X and r are associated with p feature vectors and p ratings,

In our studied heterogeneous collaborative recommendation (HCR) problem, we have n users and m items in the target data, for which we have observed some explicit feedbacks of ratings, e.g., rui for user u’s graded preference on item i. Besides the target explicit feedbacks, we also have some auxiliary data of implicit examination records such as users’ actions of browsing and clicks. We use R ¼ ½r ui nm and E ¼ ½eui nm to denote the explicit ratings and implicit examinations, respectively. Our goal is then to design an effective and efﬁcient knowledge transfer solution to transfer knowledge from the auxiliary implicit feedbacks E to the target explicit feedbacks R, in order to address the sparsity problem of graded preferences in the target data. Note that the users and items are the same in both target data

i.e.,

X ¼ ½xui p1 2 f0; 1gpðnþmÞ

and

Table 1 Some notations used in the paper.

(i)

Notation

Description

n; m R R 2 fR [ ?gnm E 2 f1; ?gnm

Number of users and items Rating range, e.g., f0:5; 1; . . . ; 5g Numerical rating matrix Unary examination matrix of implicit feedbacks Binary examination matrix converted from E User ID Item ID Rating and examination of user u on item i Number of ratings and examinations

Eb 2 f1; 0gnm u 0 i; i rui 2 R [ ?; eui 2 f1; ?g p; pe

2. Background 2.1. Problem deﬁnition

respectively,

(ii)

d U 2 Rnd

Number of latent dimensions in SVD Users’ latent preferences

V 2 Rmd

Items’ latent features

(iii) g; s Number of user groups and item sets G 2 f0; 1gng ; S 2 f0; 1gms Membership matrix of users and items Gu ; Si User group of user u and item set of item i (iv) ðu iÞ ðGu Si Þ ðGu iÞ; ðu Si Þ

(v)

ðGu uÞ; ðSi iÞ

Individual-level interaction between u and i Cluster-level interaction between Gu and Si Hybrid interaction between Gu and i, and u and Si Adaptation from Gu to u, and from Si to i

X 2 f0; 1gpðnþmÞ

Design or feature matrix

xui 2 f0; 1g1ðnþmÞ

Design or feature vector of triple ðu; i; r ui Þ

r 2 Rp1 ~ 2 f0; 1gpðnþmþgþsÞ X

Rating vector

f

Number of latent dimensions of FM

Expanded design or feature matrix

Fig. 1. Illustration of compressed knowledge transfer (CKT) via factorization machine (FM) for heterogeneous collaborative recommendation (HCR).

236

W. Pan et al. / Knowledge-Based Systems 85 (2015) 234–244

r ¼ ½r ui p1 2 Rp1 , where p is the number of observed explicit

U0 ; B; V0

ratings in R. For a typical (user, item, rating) triple, i.e., ðu; i; r ui Þ, it

where d means that we only keep the d largest singular values and their corresponding singular vectors. Note that SVD has the effect of noise reduction [22], which is helpful for knowledge mining from the uncertain implicit feedbacks. With the factorized variables, we

is represented as ðxui ; r ui Þ, where xui 2 f0; 1g1ðnþmÞ is a feature vector with the uth and (n þ i) th entries being 1 and all other entries being 0. Note that such a representation of xui is usually called dummy coding. The rating r ui is then put in the corresponding location of the target vector r. With this new representation, FM [24] then models pairwise interactions for every two non-zero features of each feature vector xui via two latent vectors, one latent vector for one feature. The most commonly used formulation of FM is the second order pairwise interactions with factorized variables, which inherits the advantages of support vector machine (SVM) [4] and matrix factorization (MF) [28]. Speciﬁcally, the rating of user u on item i in FM is approximated as follows [24],

^r ui ¼ wð0Þ þ

n þm nþm n þm X X X 0 0 T wðjÞxui ðjÞ þ xui ðjÞxui ðj Þv ðjÞv ðj Þ ; j¼1

where

the

j¼1 j0 ¼jþ1

scalars

wðjÞ 2 R; j ¼ 0; 1; . . . ; n þ m and vectors are model parameters to be learned. Once the model parameters have been learned, we can estimate each user’s preferences on each item, which can then be used for personalized recommendation. Note that the above formula can be reformulated to result in an efﬁcient linear time complexity computation [24]. One of the brightest aspect of FM is its high ﬂexibility to integrate various auxiliary data via expanding the design matrix with additional columns, such as temporal information, user demographics, item descriptions and auxiliary feedbacks [16,24]. However, one of the major limitation also arises from such straightforward expansions, in particular of low efﬁciency, because raw auxiliary data is usually much more than the target data. That is also our motivation to design a compressed knowledge transfer solution when leveraging knowledge from auxiliary implicit feedbacks via FM. With compressed knowledge transfer, we expect to obtain more accurate prediction performance than FM on explicit ratings only, and to achieve more efﬁcient knowledge transfer than FM with raw auxiliary data regarding time and space complexity. Hence, we expect to have a well-balanced solution between effectiveness and efﬁciency in exploiting heterogeneous feedbacks.

v ðjÞ 2 R1f ; j ¼ 1; 2; . . . ; n þ m

3. Compressed knowledge transfer via factorization machine Our proposed compressed knowledge transfer (CKT) solution contains two major steps of compressed knowledge mining and compressed knowledge integration. Speciﬁcally, in the ﬁrst step, we aim to reduce the noise effect of implicit feedbacks and then mine some compressed knowledge of user groups and item sets; and in the second step, we propose to transfer the mined compressed knowledge via feature expansion of factorization machine. We will describe these two steps in detail in the following two subsections. 3.1. Compressed knowledge mining For convenience of notation, we replace the missing values in E with 0s, and thus have a full binary matrix Eb 2 f1; 0gnm . Note that we do not need to store the full binary matrix in memory, and represent it in a parsimonious way via recording the 1 s only in Eb . For the auxiliary implicit feedbacks, we ﬁrst adopt singular value decomposition (SVD)1 to learn users’ latent preferences and items’ latent features, 1

http://www.mathworks.com/help/matlab/ref/svds.html.

SVDðEb ; dÞ;

ð1Þ

use U ¼ U0 B1=2 2 Rnd and V ¼ V0 B1=2 2 Rmd to denote the users’ latent preferences and items’ latent features, respectively. In HCR, the semantic meanings of auxiliary implicit examinations and target explicit ratings are very different, i.e., an examination record ðu; iÞ and a rating record ðu; i; rui Þ represent different levels of uncertainties of the user’s preferences. However, the user homophily such as user groups in two data are usually similar, because two users that have similar browsing behaviors in the auxiliary data are likely to have similar tastes in the target data. Similarly, the item correlation behind two types of feedbacks are also likely to be similar. With this assumption, we apply k-means2 clustering to the users’ latent preferences U ¼ U0 B1=2 and items’ latent features V ¼ V0 B1=2 in order to mine the user groups and item sets, respectively. We represent the mining process as follows,

G

k-meansðU; gÞ; S

k-meansðV; sÞ;

ð2Þ

where g denotes the number of user groups and s denotes the number of item sets, and G 2 f0; 1gng with G1g1 ¼ 1n1 and S 2 f0; 1gms with S1s1 ¼ 1m1 are the membership matrices for users and items, respectively. The homophily and correlation among users and items are thus encoded in G and S, because two users’ (or two items’) membership vectors will be the same if they belong to the same group (or set). As compared with the raw data of implicit feedbacks, the knowledge of user groups and item sets in G and S are much compressed, because the number of groups and sets are usually much smaller than the number of users and items, respectively, i.e., g n and s m. For this reason, we call the ﬁrst step of our solution as compressed knowledge mining, which is illustrated in the middle part of Fig. 1 (denoted as ‘‘Compressed knowledge’’). 3.2. Compressed knowledge integration In a typical factorization model [28], we usually focus on modeling pairwise interactions between a user and an item if the corresponding (user, item, rating) triple ðu; i; rui Þ is observed. However, when the explicit ratings are sparse, such individual-level interaction between a user u and an item i, denoted as ðu iÞ, may not be reliable for characterizing users’ preferences. As a response, we propose to integrate and transfer the mined compressed knowledge, i.e., user group Gu of user u and item set Si of item i, to the target task of preference learning of user u on item i via factorization machine [4]. Speciﬁcally, we design three new types of interactions, including (i) cluster-level interaction between user groups and item sets, (ii) hybrid interaction between user groups and items (or between users and item sets), and (iii) preference adaptation from user groups in auxiliary data to users in target data (or feature adaptation from item sets to items), which are inspired by the feature engineering process of factorization machine [4] in the context of HCR. First, we will describe the cluster-level interaction and hybrid interaction. Cluster-level interaction ðGu Si Þ: Cluster-level interaction as deﬁned on user groups and item sets is a smoothed version of individual-level interaction, which may help alleviate the sparsity problem to some extent. Speciﬁcally, ðGu Si Þ aims to

2

http://www.mathworks.com/help/stats/kmeans.html.

W. Pan et al. / Knowledge-Based Systems 85 (2015) 234–244

approximate the rating rui via modeling the interaction between a user group and an item set which are mined from the auxiliary implicit feedbacks. Note that the cluster-level rating pattern in [6,12,13,18] is a g by s non-negative matrix, which is thus different from our membership matrices, i.e., G 2 f0; 1gng and S 2 f0; 1gms . As compared with the codebook in [6,12,13,18], the user groups G and item sets S in our solution are expected to transfer more knowledge and to model richer cluster-level interaction. Hybrid interaction ðGu iÞ and ðu Si Þ: Hybrid interaction is a mix of individual-level interaction and cluster-level interaction. Speciﬁcally, ðGu iÞ is for the preference of the group that user u belongs to on the item i, and ðu Si Þ is for user u’s overall preference on the set that item i belongs to. Interactions between a user group and an item have been explored for recommendation with implicit feedbacks only, e.g., group preference based Bayesian personalized ranking (GBPR) [20]. However, the user groups in GBPR [20] are not mined from auxiliary data and ﬁxed, but are randomly constructed during the learning procedure, which is thus not able to model real hybrid interaction since there is no membership matrix. Furthermore, the studied problem setting in GBPR [20] is different from our HCR in Fig. 1. Similar to the aforementioned cluster-level interaction, the hybrid interaction can also be considered as a smoothing approach and thus may help alleviate the sparsity problem. So far, we have assumed that the compressed knowledge as mined from the auxiliary implicit examinations is the same with that from the target explicit ratings. However, there may still be some inconsistency between the hidden preferences of users in two data. In order to mitigate such inconsistency, an adaptation is usually adopted in knowledge transfer methods [19,22]. Hence, we further propose two novel pairwise interactions, i.e., ðGu uÞ and ðSi iÞ, in order to adapt the user preferences and item features from the auxiliary data to the target data in a principled way. Speciﬁcally, ðGu uÞ is for the consistency modeling between the group Gu ’s preference as reﬂected in the auxiliary examination records and user u’s preference in the target explicit ratings. Similarly, ðSi iÞ is for the consistency modeling between the latent features of item set Si in the auxiliary data and that of item i in the target data. For example, a strong interaction between group Gu and user u means that the group’s preference and the user’s preference is consistent, or the user homophily is similar in two data. Finally, we have two families of interactions, one for preference modeling and one for preference or feature adaptation,

Preference : ðu iÞ; ðGu Si Þ; ðGu iÞ; ðu Si Þ; Adaptation : ðGu uÞ; ðSi iÞ: It is interesting to see that the above six interactions are actually all possible pairwise interactions among four entities, i.e., u; i; Gu , and Si , which are shown in Fig. 2. Note that when Gu and Si are hard membership vectors, i.e., one user (or item) belongs to one and only one group (or set), all these interactions can be modeled exactly via FM [24] by expanding the design matrix,

~ 2 f0; 1gpðnþmþgþsÞ ; X 2 f0; 1gpðnþmÞ ! X

ð3Þ

where g and s are the number of user groups and item sets, respectively, and each original design or feature vector xui in X will then be ~ui 2 f0; 1g1ðnþmþgþsÞ . Note that we do not introduce extended to x normalization on the appended compressed knowledge as usually adopted by SVD++ [11], because each user (or item) belongs to only one group (or set). We may only transfer compressed knowledge of user groups or item sets rather than both, which will then result in a

237

Fig. 2. Illustration of the six pairwise interactions with mined compressed knowledge of a user group and an item set. The solid line is for individual-level interaction, the dashed lines are for cluster-level interaction and hybrid interaction, and the arrows are for preference or feature adaptation.

~ui 2 f0; 1g1ðnþmþgÞ for user groups and shorter feature vector, i.e., x 1ðnþmþsÞ ~ui 2 f0; 1g x for item sets. We will study the empirical performance of transferring compressed knowledge of user groups, item sets and both in Section 4. ~ that integrates compressed With the expanded design matrix X knowledge G and S via introducing several interactions for preference modeling and adaptation, we deploy the available implementation of factorization machine [24] (i.e., the libFM software3) for further learning and prediction,

~ rÞ: FMðR; G; SÞ ! FMðX;

ð4Þ

3.3. Algorithm We depict the above two major steps of compressed knowledge mining and compressed knowledge integration in Fig. 3, which contains four speciﬁc components of denoising, clustering, incorporation and factorization. Speciﬁcally, we ﬁrst apply singular value decomposition to the converted binary examination matrix so as to denoise the raw examination records and extract latent variables, which are then used by k-means clustering to mine some compressed knowledge of user groups and item sets. After that, we transfer the mined compressed knowledge via integrating them into the target design matrix. Finally, we factorize the expanded design matrix via factorization machine. From Fig. 3, we can also see that our CKT-FM is quite general and ﬂexible, because we may derive a new solution with an alternative algorithm for a typical component. For example, we may use a different clustering algorithm for the clustering component. As for time complexity, our CKT-FM is much more efﬁcient than FM with raw implicit feedbacks, because the design matrix in CKT-FM, i.e., f0; 1gpðnþmþgþsÞ , is much smaller than that in FM with raw implicit feedbacks, i.e., f0; 1gpðnþmþmÞ , which mimics SVD++ [11]. Furthermore, the number of pairwise interactions and model parameters of CKT-FM is also much fewer than that of FM and SVD++. Our empirical results on CPU time and memory usage also conﬁrm this analysis. Note that the step of denoising via SVD is rather efﬁcient for our implicit feedback matrix with most entries being 0. The k-means clustering algorithm is also very efﬁcient, where we ﬁnd that it converges well when the iteration number is smaller than 300.

3

http://www.libfm.org/.

238

W. Pan et al. / Knowledge-Based Systems 85 (2015) 234–244 Table 2 Description of MovieLens10M (n ¼ 71; 567; m ¼ 10; 681) (n ¼ 147; 612; m ¼ 48; 794) used in the experiments.

and

Flixter

Data set

Record number

Ratio (pe =p)

MovieLens10M Explicit (training) Implicit (training) Explicit (test)

p ¼ f5; 10; 15g 71; 567 pe ¼ 4; 000; 022 2,000,010

11:2; 5:6; 3:7

Flixter Explicit (training) Implicit (training) Explicit (test)

p ¼ f5; 10; 15g 147; 612 pe ¼ 3; 278; 431 1,639,215

4:4; 2:2; 1:5

A formal description of the data is shown in Table 2. Evaluation metric For quantitative evaluation of the effectiveness of the compressed knowledge transfer solution for rating prediction, we adopt a commonly used evaluation metric called Root Mean Square Error (RMSE),

RMSE ¼

rﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃﬃ X 2 ðr ui ^rui Þ =jT E j ðu;i;r Þ2T ui

E

where r ui and ^r ui are the true and predicted ratings, respectively, and jT E j is the number of test ratings.

Fig. 3. The algorithm of CKT-FM (compressed knowledge transfer via factorization machine).

4. Experimental results 4.1. Data sets and evaluation metric MovieLens10M MovieLens10M4 is a public recommendation data set with n ¼ 71; 567 users, m ¼ 10; 681 items, and 10; 000; 000 ratings in f0:5; 1; . . . ; 5g. As far as we know, there are no publicly available data including both explicit feedbacks and implicit feedbacks. In order to simulate the problem setting as shown in Fig. 1, we follow previous works [15,30], and preprocess the data as follows. First, we randomly split the (user, item, rating) triples into ﬁve sets with equal size. Second, we take one set as target explicit feedbacks for test, two sets as target explicit feedbacks for training, and the remaining two sets as auxiliary data for training. Third, we adopt a common approach [15,30] to convert all (user, item, rating) triples in auxiliary data to (user, item) pairs as implicit feedbacks via removing the rating values. Fourth, we randomly take 5n; 10n and 15n ratings from target explicit feedbacks, so that every user has 5, 10 and 15 ratings on average. We use these data to study the effectiveness of sparsity reduction of the proposed compressed knowledge transfer solution. We then repeat the second, third and fourth steps for ﬁve times and get ﬁve copies of data in order to conduct 5-fold empirical studies. Flixter Flixter5 [9] contains n ¼ 147; 612 users, m ¼ 48; 794 items and 8,196,077 ratings in f0:5; 1; . . . ; 5g. We preprocess this data in the same way as that of the above MovieLens10M data. In order to have some deep understanding of the effectiveness of the proposed knowledge transfer solution, we also calculate the ratio of the number of auxiliary examination records to the number of target explicit ratings, i.e., pe =p, as shown in the last column of Table 2. We can see that the ratios of the above two data sets are quite different, which is also reﬂected on the prediction performance in Tables 4 and 5.

4 5

http://grouplens.org/datasets/movielens/. http://www.cs.sfu.ca/sja25/personal/datasets/.

4.2. Baselines and parameter settings Our proposed knowledge transfer solution is a generic framework with four speciﬁc components of denoising, clustering, incorporation and factorization via factorization machine [24]. In order to study the effect of the proposed knowledge transfer solution more directly, we choose factorization machine which mimics SVD++ [11] as our major baseline. Note that factorization machine is a very strong baseline, which has won several international competition awards including KDD CUP 2012 [25] and ECML PKDD Discovery Challenge 2013 [3]. Due to our major motivation of sparsity reduction for explicit ratings, we also include a smoothing method called user-based average ﬁlling (UAF), i.e., P Pm ^rui ¼ r u ¼ m j¼1 yuj r uj = j¼1 yuj , where yuj ¼ 1 if the rating r uj is observed and yuj ¼ 0 otherwise. Furthermore, we also compare CKT-FM with a basic user-based collaborative ﬁltering (UCF) P P method, i.e., ^r ui ¼ ru þ w2N i swu ðr wi r w Þ= w2N i swu , where swu is u

u

i

the Pearson correlation between user u and user w and N u is a neighboring set of users w.r.t. user u and item i. Note that we use the whole neighboring set due to the sparsity of the explicit feedbacks. For the algorithms to solve the optimization problem in factorization machine [24], we have chosen MCMC among SGD, SGDA, ALS and MCMC as implemented in the libFM software.6 In our preliminary studies, we ﬁnd that MCMC usually generates much better results with fewer parameter conﬁgurations. Since the focus of our empirical study is on the effectiveness of the proposed knowledge transfer solution instead of designing a new optimization algorithm, we thus choose the MCMC algorithm to avoid external factors such as tedious parameter conﬁgurations. In order to decide an empirically good value of the initialization parameter r of MCMC in FM, we construct a validation set via randomly taking 1 rating per user on average from the training data of the target explicit ratings. We then use the remaining training data to train the model with different values of r 2 f0:01; 0:05; 0:10; 0:15; 0:20; 0:25g. The value of r with best performance on RMSE with 200 iterations on the ﬁrst copy of each data is selected and ﬁxed for further empirical studies. We include 6

http://www.libfm.org/.

239

W. Pan et al. / Knowledge-Based Systems 85 (2015) 234–244

the prediction performance and the corresponding values of r in Tables 4 and 5, and report the results using different iterations of f50; 100; 150; 200g in Fig. 4. For the number of latent dimensions in the factorization machine, we ﬁx it as 20 [32]. Note that in the ﬁrst step of our solution, we use d ¼ 100 in SVD as shown in Eq. (1) in order to ﬁnd users’ latent preferences and items’ latent features, and use g ¼ s ¼ 200 in k-means as shown in Eq. (2) with 300 iterations and Euclidean distance so as to mine the user groups and item sets. For convenience of comparative studies and discussions, we denote user-based average ﬁlling (UAF) with explicit ratings as UAF(R), user-based collaborative ﬁltering (UCF) with explicit ratings as UCF(R), factorization machine (FM) with explicit ratings as FM(R), FM with explicit ratings and raw implicit examinations as FM(R; E), compressed knowledge transfer (CKT) with explicit ratings and user groups as CKT-FM(R; G), CKT with explicit ratings and item sets as CKT-FM(R; S), CKT with explicit ratings, user groups and item sets as CKT-FM(R; G; S). 4.3. Results We study the empirical performance of CTK-FM mainly with the following three questions. First, is CKT-FM more efﬁcient than FM with raw implicit feedbacks, i.e., FM(R; E)? Second, is CKT-FM more accurate than FM without auxiliary implicit feedbacks, i.e., FM(R)? Third, is the compressed knowledge useful? Speciﬁcally,

0.905

4.3.1. Main results In our preliminary studies, we ﬁnd that CKT-FM(R; G; S) is much more efﬁcient than FM(R; E), which veriﬁes our complexity analysis in Section 3.3. In order to study the efﬁciency issue more precisely, we control the computing environment when calculating the CPU time and memory usage. Speciﬁcally, we conduct experiments on Windows Server 2008 with Intel(R) Core(TM) i7-3770 CPU @ 3.40 GHz (1-CPU/4-core)/12 GB RAM, where all other non-system processes are terminated due to the high space complexity of FM(R; E). We record the CPU time and memory usage on running FM(R), FM(R; E) and CKT-FM(R; G; S) on the ﬁrst copy of MovieLens10M and Flixter, and report the results in Table 3. The reported CPU time and memory usage are about the FM part, excluding the steps of SVD and k-means, because we focus on studying the efﬁciency of compressed knowledge transfer as compared with that of leveraging the raw implicit feedbacks. Another reason that we do not include the cost of SVD and k-means is that they are implemented in MATLAB, which may be not comparable with the C++ implementation of FM. Note that the libFM software has two implementations, i.e., (i) FM without block structure [24] and (ii) FM with block structure [26], where the latter exploits the repeating patterns of the design matrix and is thus of low time and space complexity. From the quantitative results of CPU time

0.915

FM(R) CKT−FM(R,G,S)

0.895 0.89 0.885

50

100

0.875 0.87

50

100

150

Iteration number

0.89

FM(R) CKT−FM(R,G,S)

200

FM(R) CKT−FM(R,G,S)

0.885

0.865

0.88 0.875 0.87

50

100

150

Iteration number

0.865

0.865

200

0.85

100

150

Iteration number

200

FM(R) CKT−FM(R,G,S)

0.865

RMSE

0.855

50

0.87

FM(R) CKT−FM(R,G,S)

0.86

RMSE

0.9

0.89

200

RMSE

RMSE

150

Iteration number

0.88

0.845

0.905

0.895

0.885

0.86

FM(R) CKT−FM(R,G,S)

0.91

RMSE

0.9

RMSE

we answer the ﬁrst question in Section 4.3.1, the second question in Sections 4.3.1–4.3.3, and the third question in Section 4.3.4.

0.86 0.855

50

100

150

Iteration number

200

0.85

50

100

150

Iteration number

200

Fig. 4. Prediction performance of CKT-FM and FM on MovieLens10M and Flixter with different iteration numbers. We adopt the implementation of FM w/o block structure [24] in the experiments. Note that the number in each parentheses denotes the number of ratings per user on average.

240

W. Pan et al. / Knowledge-Based Systems 85 (2015) 234–244

Table 3 CPU time and memory usage on running factorization machine (FM) with explicit ratings, i.e., FM(R), FM with explicit ratings and raw implicit examinations, i.e., FM(R; E), and FM with explicit ratings and compressed knowledge, i.e., CKTFM(R; G; S), on the ﬁrst copy of MovieLens10M and Flixter. The number of iterations is ﬁxed as 200. The number of ratings per user on average is 10. Data

CPU time (min.)

Memory usage (GB)

FM w/o block structure [24] MovieLens10M FM(R) FM(R; E) CKT-FM(R; G; S) Flixter FM(R) FM(R; E) CKT-FM(R; G; S)

Algorithm

15 915 24 20 2345 34

0.16 4.1 0.20 0.18 9.7 0.23

FM w/block structure [26] MovieLens10M FM(R) FM(R; E) CKT-FM(R; G; S) Flixter FM(R) FM(R; E) CKT-FM(R; G; S)

6 17 7 13 24 13

0.18 0.22 0.18 0.21 0.25 0.21

Table 4 Prediction performance of CKT-FM and other methods on MovieLens10M. The number in each parentheses denotes the number of ratings per user on average. The bold numbers denote the corresponding best results. Data

Algorithm

MovieLens10M (5)

UAF(R) UCF(R) FM(R) CKT-FM(R; G) CKT-FM(R; S) CKT-FM(R; G; S) FM(R; E)

MovieLens10M (10)

MovieLens10M (15)

and memory usage in Table 3, we can have the following observations: Factorization machine with explicit ratings, i.e., FM(R), is the most efﬁcient one, which is consistent with our analysis on the relationship between the efﬁciency and the size of the design matrix and the number of model parameters. And the time and space cost of CKT-FM(R; G; S) is only slightly higher than that of FM(R) and is much lower than that of FM(R; E), which clearly shows that our compressed knowledge transfer solution is very efﬁcient. The CPU time and memory usage of FM(R; E) on Flixter increase as compared with that on MovieLens. The reason for this is mainly that there are more items in Flixter as shown in Table 2, and according to the formula of FM (which mimics SVD++ [11]), each examined item by a certain user will be appended to the design matrix, resulting in a much larger number of pairwise interactions and model parameters. We then conduct extensive empirical studies of UAF(R), UCF(R) and FM(R), and three variants of CKT-FM, i.e., CKT-FM(R; G), CKT-FM(R; S) and CKT-FM(R; G; S). We also include the results of FM(R; E) for reference although it takes much more time as shown in Table 3. The number of iterations is ﬁxed as 200. We adopt the implementation of FM w/o block structure [24] in the experiments. We report the results in Tables 4 and 5, from which we can have the following observations: Factorization based methods are much better than the smoothing method (i.e., UAF) and memory-based method (i.e., UCF), which shows that factorization machine is indeed a very strong baseline and is also consistent with various previous works. CKT-FM(R; G), CKT-FM(R; S) and CKT-FM(R; G; S) are all better than FM(R), UAF(R) and UCF(R), which clearly shows the usefulness of the shared compressed knowledge and the effectiveness of our knowledge transfer approach. CKT-FM(R; G; S) further improves the performance over CKT-FM(R; G) and CKT-FM(R; S), which shows that the compressed knowledge of user groups and item sets are complementary for the learning task on the target rating data. FM(R; E) performs well in both data w.r.t. the prediction accuracy (i.e., the best on MovieLens10M and second best on Flixter) as expected. However, the time and space cost of FM(R; E) is high, while our CKT-FM(R; G; S) is a good balance between efﬁciency and effectiveness. Furthermore, our CKT-FM(R; G; S) performs

UAF(R) UCF(R) FM(R) CKT-FM(R; G) CKT-FM(R; S) CKT-FM(R; G; S) FM(R; E) UAF(R) UCF(R) FM(R) CKT-FM(R; G) CKT-FM(R; S) CKT-FM(R; G; S) FM(R; E)

Parameter

RMSE

r ¼ 0:25 r ¼ 0:25 r ¼ 0:20 r ¼ 0:20 r ¼ 0:20

1.0635 ± 0.0009 1.0384 ± 0.0006 0.8971 ± 0.0008 0.8927 ± 0.0006 0.8901 ± 0.0005 0.8868 ± 0.0008 0.8826 ± 0.0006

r ¼ 0:20 r ¼ 0:20 r ¼ 0:15 r ¼ 0:15 r ¼ 0:15

1.0280 ± 0.0007 0.9539 ± 0.0005 0.8707 ± 0.0008 0.8667 ± 0.0010 0.8659 ± 0.0007 0.8618 ± 0.0008 0.8564 ± 0.0006

r ¼ 0:20 r ¼ 0:15 r ¼ 0:15 r ¼ 0:15 r ¼ 0:15

1.0111 ± 0.0007 0.9233 ± 0.0007 0.8550 ± 0.0005 0.8505 ± 0.0007 0.8503 ± 0.0005 0.8462 ± 0.0007 0.8409 ± 0.0005

Table 5 Prediction performance of CKT-FM and other methods on Flixter. The number in each parentheses denotes the number of ratings per user on average. The bold numbers denote the corresponding best results. Data

Algorithm

Parameter

RMSE

r ¼ 0:20 r ¼ 0:20 r ¼ 0:15 r ¼ 0:15 r ¼ 0:20

0.9534 ± 0.0012 0.9498 ± 0.0011 0.9035 ± 0.0010 0.9027 ± 0.0010 0.8968 ± 0.0007 0.8937 ± 0.0008 0.8969 ± 0.0008

r ¼ 0:15 r ¼ 0:15 r ¼ 0:15 r ¼ 0:15 r ¼ 0:15

0.9379 ± 0.0010 0.9242 ± 0.0008 0.8753 ± 0.0010 0.8747 ± 0.0008 0.8711 ± 0.0009 0.8687 ± 0.0008 0.8705 ± 0.0008

r ¼ 0:15 r ¼ 0:15 r ¼ 0:10 r ¼ 0:10 r ¼ 0:15

0.9309 ± 0.0009 0.9115 ± 0.0008 0.8598 ± 0.0008 0.8591 ± 0.0007 0.8568 ± 0.0008 0.8549 ± 0.0008 0.8561 ± 0.0007

Flixter (5) UAF(R) UCF(R) FM(R) CKT-FM(R; G) CKT-FM(R; S) CKT-FM(R; G; S) FM(R; E) Flixter (10) UAF(R) UCF(R) FM(R) CKT-FM(R; G) CKT-FM(R; S) CKT-FM(R; G; S) FM(R; E) Flixter (15) UAF(R) UCF(R) FM(R) CKT-FM(R; G) CKT-FM(R; S) CKT-FM(R; G; S) FM(R; E)

better than FM(R; E) on Flixter, which shows that the step of noise reduction is helpful and the compressed knowledge is more useful than the raw implicit feedbacks. The prediction performance shows that the beneﬁt or improvement from knowledge transfer is larger on MovieLens10M than that on Flixter, which is consistent with the relative size of ratios of auxiliary examination records to target explicit ratings, i.e., pe =p 2 f11:2; 5:6; 3:7g for MovieLens10M and pe =p 2 f4:4; 2:2; 1:5g for Flixter as shown in Table 2. Also, the improvement in cases with fewer ratings (e.g., 5 ratings per user on average) is larger than those with more ratings (e.g., 15 ratings per user on average), which shows that our CKT-FM is helpful for sparsity alleviation in the target explicit ratings.

W. Pan et al. / Knowledge-Based Systems 85 (2015) 234–244

Overall, the results in Tables 3–5 show that transferring compressed knowledge of user groups and item sets from auxiliary implicit feedbacks to target explicit feedbacks is helpful, and the proposed knowledge transfer solution via factorization machine is both efﬁcient and effective. 4.3.2. Results with different iteration numbers The results of CKT-FM and FM with different iteration numbers are shown in Fig. 4, from which we can have the following observations: Both CKT-FM(R; G; S) and FM(R) converge smoothly with about 200 iterations in most cases, and the relative prediction performance of each approach is proportional to the numbers of ratings per user on average in both data sets. CKT-FM(R; G; S) is signiﬁcantly better than FM(R) on all iteration numbers in almost all cases (except when the iteration number is smaller than 100 on Flixter with 15 ratings per user on average), which again shows the advantages of CKT-FM(R; G; S) with compressed knowledge transfer. The special case when the iteration number is smaller than 100 on Flixter with 15 ratings per user on average is because of the inconsistency of the compressed knowledge of user groups and item sets between the auxiliary implicit examinations and target explicit ratings when the learning is not sufﬁcient. 4.3.3. Results on different user segmentations In order to have a deep understanding of the performance gain of our CKT-FM over the major baseline method FM, we analyze the

241

performance of each method on different user segmentations. Speciﬁcally, we construct eight and twelve user segmentations w.r.t. different numbers of ratings in the test data, which is shown in two tables in Figs. 5 and 6. Note that due to the process of random generation of training data and test data as described in Section 4.1, the distributions of user segmentations of training data and test data are similar. The results of CKT-FM and FM on different user segmentations are shown in Figs. 5 and 6, from which we can have the following observations: The results on active users (who have rated more items) are better than that on inactive users, and the overall performance in cases with more ratings per user on average (e.g., 15) is better than those with fewer (e.g., 5 or 10), which are consistent with observations in other works [13,22]. CKT-FM(R; G; S) is better than FM(R) on all user segmentations in all cases, which again clearly shows the advantages of our proposed knowledge transfer approach. In summary, the results in Figs. 4–6 clearly show that our CKT-FM converges smoothly and performs signiﬁcantly better than the method using explicit feedbacks only in all cases, including different user segmentations on data with different levels of sparsity.

4.3.4 Improvement from compressed knowledge In this section, we study two questions about the superior prediction performance of CKT-FM, in particular of the compressed knowledge. First, is the performance improvement of CKT-FM over FM simply from using more model parameters rather than the

Fig. 5. Prediction performance of CKT-FM and FM on the ﬁrst copy of MovieLens10M for different user segmentations. The number of iterations is ﬁxed as 200. We adopt the implementation of FM w/o block structure [24] in the experiments. Note that the number in each parentheses denotes the number of ratings per user on average.

242

W. Pan et al. / Knowledge-Based Systems 85 (2015) 234–244

Fig. 6. Prediction performance of CKT-FM and FM on the ﬁrst copy of Flixter for different user segmentations. The number of iterations is ﬁxed as 200. We adopt the implementation of FM w/o block structure [24] in the experiments. Note that the number in each parentheses denotes the number of ratings per user on average.

mined knowledge? Note that the number of model parameters of CKT-FM and FM are ðn þ m þ g þ sÞ f þ f þ 1 and ðn þ mÞ f þ f þ 1, respectively, where f is the latent dimension of FM. Second, is the denoising step via singular value decomposition helpful on mining high quality compressed knowledge?. In order to answer the ﬁrst question, we conduct additional experiments with comparable number of model parameters. Speciﬁcally, we ﬁx f ¼ 20 in CKT-FM as before and use f ¼ 21 in FM, where the total number of model parameters in CKT-FM is now slightly fewer than that in FM. The prediction performance

Table 6 Prediction performance of CKT-FM and FM with comparable numbers of model parameters on MovieLens10M and Flixter. The number of iterations is ﬁxed as 200. The number of ratings per user on average is 10. We adopt the implementation of FM w/o block structure [24] in the experiments. The bold numbers denote the corresponding best results.

is shown in Table 6, from which we can clearly see that simply using more model parameters does not help much. Hence, the answer to the ﬁrst question is no, i.e., the performance improvement is not from using more model parameters, but from the compressed knowledge of user groups and item sets. For the second question on the effect of denoising, we conduct comparative studies between CKT-FM with and without singular value decomposition for noise reduction. Speciﬁcally, we remove step 1.1 in Algorithm 3 and revise step 1.2 via clustering on E and ET (instead of on U0 B1=2 and V0 B1=2 ), to obtain user groups

Table 7 Prediction performance of CKT-FM with and without the denoising step on MovieLens10M and Flixter. The number of iterations is ﬁxed as 200. The number of ratings per user on average is 10. We adopt the implementation of FM w/o block structure [24] in the experiments. The bold numbers denote the corresponding best results.

Data

Algorithm

Latent dimension

Parameter

RMSE

MovieLens10M

FM(R) FM(R) CKT-FM(R; G; S)

f ¼ 20 f ¼ 21 f ¼ 20

r ¼ 0:20 r ¼ 0:20 r ¼ 0:15

0.8707 ± 0.0008 0.8706 ± 0.0004 0.8618 ± 0.0008

Data

FM(R) FM(R) CKT-FM(R; G; S)

f ¼ 20 f ¼ 21 f ¼ 20

r ¼ 0:15 r ¼ 0:15 r ¼ 0:15

0.8753 ± 0.0010 0.8753 ± 0.0007 0.8687 ± 0.0008

Flixter

Algorithm

Denoise

Parameter

RMSE

CKT-FM(R; G; S) CKT-FM(R; G; S)

No Yes

r ¼ 0:20 r ¼ 0:15

0.8642 ± 0.0007 0.8618 ± 0.0008

CKT-FM(R; G; S) CKT-FM(R; G; S)

No Yes

r ¼ 0:15 r ¼ 0:15

0.8697 ± 0.0008 0.8687 ± 0.0008

MovieLens10M

Flixter

W. Pan et al. / Knowledge-Based Systems 85 (2015) 234–244

and item sets, respectively. The prediction performance is reported in Table 7, from which we can see that CKT-FM without the denoising step will hurt the performance. Hence, the step of noise reduction is indeed helpful for mining high quality compressed knowledge, which gives a positive answer to the second question.

5. Related work Considering the studied problem setting and proposed solution in this paper, we discuss some existing works on three closely related topics, including collaborative recommendation, heterogeneous collaborative recommendation, and transfer learning in collaborative recommendation. 5.1. Collaborative recommendation Collaborative recommendation techniques are usually categorized into memory-based methods, model-based methods and hybrid methods [2]. Memory-based methods include two similar variants of user-based and item-based recommendation approaches, where the user-based approach predicts a user’s preference on an item via aggregating his or her neighbors’ preferences on the item. Model-based methods, e.g., matrix factorization based algorithms [11,24,28], usually learn some latent user preferences and item features with the assumption that the observed ratings are generated by such latent variables. Hybrid methods include monolithic, parallelized and pipelined variants with different ways of hybridization of some basic recommendation techniques [10]. Model-based methods usually perform better in open competitions, which are able to capture the hidden correlation among users and items [11]. Memory-based methods are associated with good interpretability and maintainability, which are thus also quite popular in real deployment [14]. However, most collaborative recommendation methods are for homogeneous feedbacks, such as numerical ratings, and very few works focus on heterogeneous feedbacks such as the implicit and explicit feedbacks as shown in Fig. 1. 5.2. Heterogeneous collaborative recommendation In a recent work, a collective matrix factorization (CMF) [29] based approach is proposed to exploit both explicit feedbacks and implicit feedbacks [15], which introduces a scaling or normalization process to mitigate the heterogeneity of the two types of feedbacks. However, using the same user-speciﬁc latent preference matrix U and item-speciﬁc latent feature matrix V in CMF [15,29] for both explicit ratings and implicit examinations may still not capture the preference difference well. The expectation–maximiza tion collaborative ﬁltering (EMCF) algorithm [30] proposes to estimate graded preferences of implicit feedbacks iteratively, which can then be added to the explicit feedbacks. However, such an iterative solution may not be efﬁcient for large data, especially when there are lots of raw implicit feedbacks. The SVD++ model [11] or the equivalent constrained PMF model [28] is a principled approach for modeling explicit and implicit feedbacks simultaneously via extending the basic matrix factorization method on explicit ratings with interactions from implicit examinations, which could be mimicked by factorization machine [24]. Hence, for fair comparison of both accuracy and efﬁciency, we use factorization machine [24], i.e., FM(R; E), in our empirical studies. There are also some algorithms designed for some speciﬁc applications, such as sequential radio channel recommendation via exploiting users’ explicit and implicit feedbacks [17]. Note that the explicit feedbacks in [17] is different from the numerical

243

ratings in our HCR, and the reinforcement learning algorithm is also not applicable in our studied problem. 5.3. Transfer learning in collaborative recommendation Transfer learning [19] aims to improve a learning task in some target data via transferring knowledge from some related learning tasks or some related auxiliary data. Transfer learning in collaborative recommendation is a new and active research area and has witnessed signiﬁcant improvement of recommendation performance in several different recommendation scenarios, including recommendation without mappings between entities in two data [12], recommendation with two-side implicit feedbacks [22], recommendation with frontal-side binary explicit feedbacks [21,23], etc. In this paper, we study a new problem setting, i.e., recommendation with frontal-side implicit feedbacks, which is associated with few existing works. From the perspective of ‘‘how to transfer’’ in transfer learning [19], existing transfer learning approaches includes adaptive, collective and integrative algorithm styles. Typically, an integrative approach introduces richer interactions between the target data and auxiliary data than that of adaptive and collective ones, and can usually have better recommendation performance. Our CKT-FM is such an integrative approach since the mined compressed knowledge is incorporated into the target learning task as a whole. From the perspective of ‘‘what to transfer’’ in transfer learning [19], previous works share different types of knowledge, including covariance [1], codebook [6,12,13,18], latent features [5,8,29], etc. Note that the covariance, codebook and latent features can also be considered as some type of compressed knowledge since the raw auxiliary data are not preserved. In this paper, the mined user groups and item sets are a new type of compressed knowledge, which has not been explored before by works on transfer learning in collaborative recommendation. In summary, we have designed a novel transfer learning solution for a new recommendation problem, i.e., an integrative transfer learning approach via factorization machine with compressed knowledge for HCR as shown in Fig. 1. 6. Conclusions and future work In this paper, we have proposed a novel and generic solution called compressed knowledge transfer via factorization machine (CKT-FM) for heterogeneous collaborative recommendation (HCR). Our solution contains two major steps of mining compressed knowledge of user groups and item sets and integrating compressed knowledge via design matrix expansion in factorization machine. Extensive experimental studies on two large data sets with different levels of sparsity show that our CKT-FM is signiﬁcantly better than the state-of-the-art non-transfer learning method, and is much more efﬁcient than the method leveraging raw implicit feedbacks. For future works, we are mainly interested in designing a single uniﬁed optimization function for compressed knowledge mining and integration in order to further improve the knowledge sharing process. We are also interested in generalizing CKT-FM from pointwise regression to pairwise or listwise ranking, aiming to optimize the top-k recommended items in a more direct manner [27,31] and study its performance on some real industry data. Acknowledgements We thank the support of Natural Science Foundation of Guangdong Province No. 2014A030310268, Natural Science

244

W. Pan et al. / Knowledge-Based Systems 85 (2015) 234–244

Foundation of SZU No. 201436, National Natural Science Foundation of China (NSFC) Nos. 61170077 and 61272303, NSF GD No. 10351806001000000, GD S&T No. 2012B091100198, S&T projects of SZ Nos. JCYJ20130326110956468 and JCYJ20120613102030248, and National Basic Research Program of China (973 Plan) No. 2010CB327903. We are also thankful to the handling Editor and Reviewers for their constructive and expert comments. References [1] Ryan P. Adams, George E. Dahl, Iain Murray, Incorporating side information into probabilistic matrix factorization using Gaussian processes, in: Proceedings of the 26th Conference on Uncertainty in Artiﬁcial Intelligence, UAI’10, 2010, pp. 1–9. [2] Gediminas Adomavicius, Alexander Tuzhilin, Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions, IEEE Trans. Knowl. Data Eng. 17 (2005) 734–749. [3] Imannuel Bayer, Steffen Rendle, Factor models for recommending given names, in: ECML PKDD Discovery Challenge Workshop, 2013. [4] Chih-Chung Chang, Chih-Jen Lin, LIBSVM: a library for support vector machines, ACM Trans. Intell. Syst. Technol. (ACM TIST) 2 (3) (2011) 27: 1–27:27. [5] Sotirios Chatzis, Nonparametric bayesian multitask collaborative ﬁltering, in: Proceedings of the 22nd ACM International Conference on Information and Knowledge Management, CIKM’13, 2013, pp. 2149–2158. [6] Sheng Gao, Hao Luo, Da Chen, Shantao Li, Patrick Gallinari, Jun Guo, Crossdomain recommendation via cluster-level latent factor model, in: Proceedings of the 2013 European Conference on Machine Learning and Knowledge Discovery in Databases – Part II, 2013, pp. 161–176. [7] Liang Hu, Jian Cao, Guandong Xu, Longbing Cao, Zhiping Gu, Can Zhu, Personalized recommendation via cross-domain triadic factorization, in: Proceedings of the 22nd International Conference on World Wide Web, WWW’13, 2013, pp. 595–606. [8] Liang Hu, Jian Cao, Guandong Xu, Jie Wang, Zhiping Gu, Longbing Cao, Crossdomain collaborative ﬁltering via bilinear multilevel analysis, in: Proceedings of the 23rd International Joint Conference on Artiﬁcial Intelligence, IJCAI’13, 2013, pp. 2626–2632. [9] Mohsen Jamali, Martin Ester, A matrix factorization technique with trust propagation for recommendation in social networks, in: Proceedings of the 4th ACM Conference on Recommender Systems, RecSys’10, 2010, pp. 135–142. [10] Dietmar Jannach, Markus Zanker, Alexander Felfernig, Gerhard Friedrich, Recommender Systems: An Introduction, ﬁrst ed., Cambridge University Press, New York, NY, USA, 2010. [11] Yehuda Koren, Factorization meets the neighborhood: a multifaceted collaborative ﬁltering model, in: Proceedings of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD’08, 2008, pp. 426–434. [12] Bin Li, Qiang Yang, Xiangyang Xue, Can movies and books collaborate? Crossdomain collaborative ﬁltering for sparsity reduction, in: Proceedings of the 21st International Joint Conference on Artiﬁcial Intelligence, IJCAI’09, 2009, pp. 2052–2057. [13] Bin Li, Qiang Yang, Xiangyang Xue, Transfer learning for collaborative ﬁltering via a rating-matrix generative model, in: Proceedings of the 26th Annual International Conference on Machine Learning, ICML’09, 2009, pp. 617–624. [14] Greg Linden, Brent Smith, Jeremy York, Amazon.com recommendations: itemto-item collaborative ﬁltering, IEEE Internet Comput. 7 (1) (2003) 76–80.

[15] Nathan N. Liu, Evan W. Xiang, Min Zhao, Qiang Yang, Unifying explicit and implicit feedback for collaborative ﬁltering, in: Proceedings of the 19th ACM International Conference on Information and Knowledge Management, CIKM’10, 2010, pp. 1445–1448. [16] Babak Loni, Yue Shi, Martha A Larson, Alan Hanjalic, Cross-domain collaborative ﬁltering with factorization machines, in: Proceedings of the 36th European Conference on Information Retrieval, ECIR’14, April 2014. [17] Omar Moling, Linas Baltrunas, Francesco Ricci, Optimal radio channel recommendations with explicit and implicit feedback, in: Proceedings of the 6th ACM Conference on Recommender Systems, RecSys’12, 2012, pp. 75–82. [18] Orly Moreno, Bracha Shapira, Lior Rokach, Guy Shani, Talmud: transfer learning for multiple domains, in: Proceedings of the 21st ACM International Conference on Information and Knowledge Management, CIKM’12, 2012, pp. 425–434. [19] Sinno Jialin Pan, Qiang Yang, A survey on transfer learning, IEEE Trans. Knowl. Data Eng. 22 (10) (2010) 1345–1359. [20] Weike Pan, Li Chen, GBPR: group preference based bayesian personalized ranking for one-class collaborative ﬁltering, in: Proceedings of the 23rd International Joint Conference on Artiﬁcial Intelligence, IJCAI’13, 2013, pp. 2691–2697. [21] Weike Pan, Nathan N. Liu, Evan W. Xiang, Qiang Yang, Transfer learning to predict missing ratings via heterogeneous user feedbacks, in: Proceedings of the 22nd International Joint Conference on Artiﬁcial Intelligence, July 2011, pp. 2318–2323. [22] Weike Pan, Evan W. Xiang, Nathan N. Liu, Qiang Yang, Transfer learning in collaborative ﬁltering for sparsity reduction, in: Proceedings of the 24th AAAI Conference on Artiﬁcial Intelligence, AAAI’10, 2010, pp. 230–235. [23] Weike Pan, Qiang Yang, Transfer learning in heterogeneous collaborative ﬁltering domains, Artif. Intell. 197 (2013) 39–55. [24] Steffen Rendle, Factorization machines with LIBFM, ACM Trans. Intell. Syst. Technol. (ACM TIST) 3 (3) (2012) 57:1–57:22. [25] Steffen Rendle, Social network and click-through prediction with factorization machines, in: KDD-Cup Workshop, 2012. [26] Steffen Rendle, Scaling factorization machines to relational data, in: Proceedings of the 39th International Conference on Very Large Data Bases, PVLDB’13, VLDB Endowment, 2013, pp. 337–348. [27] Steffen Rendle, Christoph Freudenthaler, Zeno Gantner, Schmidt-Thie Lars, BPR: Bayesian personalized ranking from implicit feedback, in: Proceedings of the 25th Conference on Uncertainty in Artiﬁcial Intelligence, UAI’09, 2009, pp. 452–461. [28] Ruslan Salakhutdinov, Andriy Mnih, Probabilistic matrix factorization, Annual Conference on Neural Information Processing Systems, vol. 20, MIT Press, 2008, pp. 1257–1264. [29] Ajit P. Singh, Geoffrey J. Gordon, Relational learning via collective matrix factorization, in: Proceeding of the 14th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD’08, 2008, pp. 650–658. [30] Bin Wang, Mohammadreza Rahimi, Dequan Zhou, Xin Wang, Expectationmaximization collaborative ﬁltering with explicit and implicit feedback, in: Proceedings of the 16th Paciﬁc-Asia Conference on Advances in Knowledge Discovery and Data Mining – Volume Part I, PAKDD’12, 2012, pp. 604–616. [31] Markus Weimer, Alexandros Karatzoglou, Alex Smola, Improving maximum margin matrix factorization, in: Proceedings of the 2008 European Conference on Machine Learning and Knowledge Discovery in Databases – Part I, ECML PKDD’08, 2008, pp. 14–14. [32] Tom Chao Zhou, Hao Ma, Irwin King, Michael R. Lyu, TagRec: leveraging tagging wisdom for recommendation, in: Proceedings of the 2009 International Conference on Computational Science and Engineering, vol. 04, 2009, pp. 194–199.

Network Tomography via Compressed Sensing

Gene Selection via Matrix Factorization

Transfer Learning for Collaborative Filtering via a ...

Semi-Supervised Clustering via Matrix Factorization

Interactive Semantics for Knowledge Transfer - University of Maryland

Interactive Semantics for Knowledge Transfer - University of Maryland ...

Knowledge Transfer on Hybrid Graph

Mixed factorization for collaborative recommendation with ...

Joint Weighted Nonnegative Matrix Factorization for Mining ...

Transfer learning to predict missing ratings via ... - CiteSeerX

Worst Configurations (Instantons) for Compressed ...

Multihypothesis Prediction for Compressed ... - Semantic Scholar