Ranking with decision tree

Viewer
Transcript

Knowl Inf Syst DOI 10.1007/s10115-007-0118-y REGULAR PAPER

Ranking with decision tree Fen Xia · Wensheng Zhang · Fuxin Li · Yanwu Yang

Received: 28 December 2006 / Revised: 8 October 2007 / Accepted: 24 November 2007 © Springer-Verlag London Limited 2007

Abstract Ranking problems have recently become an important research topic in the joint field of machine learning and information retrieval. This paper presented a new splitting rule that introduces a metric, i.e., an impurity measure, to construct decision trees for ranking tasks. We provided a theoretical basis and some intuitive explanations for the splitting rule. Our approach is also meaningful to collaborative filtering in the sense of dealing with categorical data and selecting relative features. Some experiments were made to illustrate our ranking approach, whose results showed that our algorithm outperforms both perceptronbased ranking and the classification tree algorithms in term of accuracy as well as speed. Keywords

Machine learning · Ranking · Decision tree · Splitting rule

1 Introduction Ranking problems are one of the most important components in many applications including information retrieval and filtering [8]. In information retrieval, documents are ranked before being delivered to users. In order to elicit and model user preferences, more precise relevance measures are required rather than simple binary values (relevant/non-relevant). Some researchers argue that users’ relevance judgments can be reflected by a continuum of relevance regions from highly relevant, partially relevant to non-relevant [17]. The number of ratings is different according to users’ requirements and the precision of the application requirement. A collaborative filtering application such as movie recommendation aims to generate and deliver recommendations of movies that users would be likely to enjoy. Initially, users are asked to rate a list of movies that they have seen. An example for possible ratings might be run-to-see, very-good, good, only-if-you-must, and do-not-bother [5]. Consequently, these

F. Xia (B) · W. Zhang · F. Li · Y. Yang The Key Laboratory of Complex Systems and Intelligence Science, Institute of Automation, Chinese Academy of Sciences, Beijing, People’s Republic of China e-mail: [email protected]

123

F. Xia et al.

ratings are classified according to some similarity measure. Finally, recommendations are given based on users’ rating similarities. In a broad sense, the ranking aims to predict instances of ordinal scale, i.e., the so-called ordinal regression [10]. Ranking problems can be regarded as a supervised inductive learning task, which predicts values of an ordinal function for any valid input objects according to some rules learned from a number of training examples (i.e., pairs of input and target output). Traditionally, supervised learning researches mainly focus on classification [12,21] and regression problems [16], and have made significant progresses on learning algorithms and evaluation approaches [9]. However, ranking problems require higher computing functions, e.g., to order things rather than simply classify them. As a matter of fact, the ranking lies in between classification and regression. In classification, given a training set with associated class labels, the goal is to learn some rules which could assign a new instance a correct label. In the regression, outputs of the training set are real-valued. The ranking is similar to the classification in that both of outputs are a finite set, and to the regression in that there is an ordinal relationship materialized in elements in a finite label set. Various machine learning algorithms have been adapted to ranking tasks to automate ranking process [4,6,7], and to improve ranking accuracy and speed [2,20]. A natural idea is to convert ranking problems to classification or regression problems. In classification settings, a new training set is formed by extracting pairs of examples of different ratings. Then ranking rules are constructed by binary classifiers on the training set [10]. However, it is time-consuming as the data complexity is increased from O(n) to O(n 2 ). In regression settings, a proper mapping is used to convert rankings to real values. However, it is hard to determine a proper mapping, and corresponding algorithms might be sensitive to the representation of rankings rather than their pairwise orderings [10,11]. State-of-the-art ranking approaches mainly assume that rankings are coarsely measured latent continuous variables, and model them with intervals on the real line. Based on this assumption, these algorithms seek a direction representing the real line on which examples are projected and a set of thresholds dividing the direction into consecutive intervals. [5] proposed a perceptron-based algorithm (called Prank) to seek the direction and some thresholds to construct ranking rules. This is an online mistake-driven procedure initialized with a direction and a set of thresholds that are adapted each time when a training example is misclassified. This procedure is guaranteed to converge, given a direction and a set of corresponding thresholds that can correctly rank the training data. [11] generalized the Prank by running several Prank algorithms in parallel. The outputs were averaged to produce a ranking rule, which showed improved performance over a single Prank algorithm. A practical application of proceptron-based algorithms was proposed in [18], which dealt with ranking and re-ranking problems for natural language processing. The algorithm searches for and uses pairs of mis-ranked objects to update weight vectors. Perceptron-based ranking algorithms avoid the increase of data complexity, and outperform algorithms based on pairs (described in [10]) [11]. However, perceptron-based methods also have certain shortcomings, e.g., they can only deal with real inputs. When the data are linearly separable, there are many solutions, and which one is found depends on starting values. When the data are non-linearly separable, the algorithms will not converge due to the fact that they inevitably produce cycles. The cycles can be long and therefore hard to detect [10]. As a kernel mapping is needed for non-linearly separable data, and the accuracy of ranking rules is also sensitive to the kernel mapping. Decision trees can, to some degree, overcome these shortcomings of perceptron-based methods. An intuitive idea to extend decision trees being able to deal with ranking problems implies the formulation of ranking problems as multi-class classification problems. However,

123

Ranking with decision tree

this is not straightforward because the splitting rule in classification trees does not take ordinal relationships into account. As a result, misclassifications cannot be discriminated by the degree of their deviations from true ratings. Intuitively, the discrimination can be measured by the impurity on a set. A pair of irrelevant items should cause more impurity than a relevant pair. Thereafter, the more pairs with different ratings in a set, the more impure the set is. In this paper, we formulate this intuitive assumption by a new impurity measure, which introduces a metric on a set. The new impurity measure takes ordinal relationships into account, e.g., makes corresponding splitting rules to prefer child nodes with closer ranking distance, which is theoretically proved in this paper. Based on the new impurity, a decision tree is trained on ranking data sets, which we call Ranking Tree (RT), to assign a new example a ranking label. The remainder of this paper is organized as follows: Sect. 2 gives a brief description of decision trees. In Sect. 3, we describe two impurity metrics, that is, the gini impurity and our ranking impurity measure. In Sect. 4, we provide a theoretical basis for our ranking impurity measure and analyze its capacity in ranking problems. In Sect. 5, we illustrate experimental results and compare them with results in [11]. In Sect. 6, we conclude the paper and give some possible research issues in the near future.

2 Decision tree Decision trees can produce good predictions and easy-to-interpret rules. It can also accept continuous, discrete and categorical inputs, fill missing values, and select relevant features to produce simple rules. It is invariant under strictly monotone transformations of individual inputs. Decision tree learning is a method for approximating discrete-valued target functions. Learned trees can also be represented as sets of if-then rules to improve human readability. Each node in the tree specifies a test of an instance on selected attributes, and each branch descending from that node corresponds to one of the possible values for this attribute. An instance is classified by starting at the root node of the tree, testing the attribute specified by that node, then moving down the selected branch to a new node. This process is iterated until one or several leaf nodes are reached [9]. Finally, classifications associated with one or several leaf nodes are returned and possibly combined to assign a label to the instance [13]. Decision tree algorithms contain two important parts, namely the splitting criterion and the pruning criterion. The splitting criterion is used to grow the tree. In each splitting, algorithms check every possible splitting point, and choose the point where a certain impurity measure decreases most. When the process of the growth is finished, the pruning criterion is used to prune the tree in order to increase its generalization ability. The impurity measure differs in different decision tree algorithms. In classification, CART [3] takes the gini criterion, while ID3 [14] and C4.5 [15] prefer the entropy criterion. Experimentally, the gini criterion and entropy criterion are statistically indistinguishable [1]. In this paper, we propose a new impurity measure for the ranking process in decision trees, which can be readily plugged into nearly all decision tree algorithms with little adaptations.

3 The splitting rules One of the most commonly used impurity measures in classification problems is the gini impurity, as follows:

123

F. Xia et al.

3.1 The gini impurity Definition 1 Given an example set T and let pi = p(i|T ) be the proportion of examples belonging to class i, where i ∈ {1, . . . , k} is the class label, the gini impurity (also known as the gini index) is defined as Igini (T ) =

k

pi (1 − pi )

(1)

i=1

There are two interesting interpretations of the gini impurity. If an example belongs to class i with probability pi , the probability k of misclassifying it is pi (1 − pi ). Thus the expected loss on all classes is given by i=1 pi (1 − pi ). Similarly, if each example is coded as 1 (0 otherwise) for class i with probability pi , the variance of this code variable is pi (1 − pi ). k Summing over classes k gives the gini impurity again, i.e., i=1 pi (1 − pi ). A splitting operation divides a set T into two sets TL and TR , namely the left child and the right child of T , respectively. The splitting rule of the gini impurity is to find the best splitting—the one that maximizes the quantity of expected decrease in terms of the gini impurity defined as I = Igini (T ) − Igini (TL ) p(TL ) − Igini (TR ) p(TR )

(2)

where p(TL ) and p(TR ) denote the proportion of examples going to the left child, and the right child, respectively. To prevent the creation of a degenerated tree, the decrease I must be non-negative. It is easy to prove that the gini impurity is strictly concave [3]. The gini index is well suitable for classification tasks. In the ranking task, however, the gini index ignores ordinal relationships among class labels. Consider the first interpretation of the gini impurity. Misclassifying an example from class i to any other class produces an equal portion of loss. Whereas, in ranking problems mis-ranking an item further away from its actual ranking means more loss. To deal with such an unbalance loss in ranking situations, we retain another impurity measure for splitting rules to build a ranking tree. 3.2 The ranking impurity We now present our new impurity measure named ranking impurity. Definition 2 Given an example set T labeled by a totally ordered label set L = {L 1 , . . . , L k }, let Ni (T ) be the number of elements in T that have label L i , the ranking impurity is given by Irank (T ) =

j k

( j − i)N j (T )Ni (T ).

(3)

j=1 i=1

The ranking impurity can be interpreted as follows. Suppose that a1 ∈ T belongs to label L 1 , a2 ∈ T belongs to label L 2 , and L 1 < L 2 . These two examples form a mis-ranked pair if a1 is ranked after a2 . The mis-ranked pair can be weighted by the difference between their rankings, that is, L 2 − L 1 . Since every example has the potential of drawing a weighted mis-ranked pair with each of the examples ranked before it, the maximum potential number of weighted mis-ranked pairs for the example is given by the number of all examples ranked before it. Summing over all examples gives the ranking impurity.

123

Ranking with decision tree

The splitting rule of our ranking impurity searches for the best splitting point that could maximize the quantity of decrease in terms of the ranking impurity. It’s defined as I = Irank (T ) − Irank (TL ) − Irank (TR ).

(4)

The objective of the splitting rule can be interpreted as to minimize the maximum potential number of weighted mis-ranked pairs in both TL and TR . The I in (4) is positive iff neither TL nor TR is empty, as shown in the following proposition: Proposition 1 I in (4) is non-negative. It is positive iff neither TL nor TR is empty. Proof Note that Ni (T ) = Ni (TL ) + Ni (TR ), then Irank (T ) =

j k

( j − i)N j (T )Ni (T )

j=1 i=1

=

j k

( j − i)(N j (TL ) + N j (TR ))(Ni (TL ) + Ni (TR ))

j=1 i=1

≥

j k

( j − i)(N j (TL )Ni (TL ) + N j (TR )Ni (TR ))

j=1 i=1

= Irank (TL ) + Irank (TR ) Thus, I is non-negative. The equality applies only when both Ni (TL ) and Ni (TR ) are zero. Therefore, if neither TL nor TR is empty, I is positive.

4 Theoretical proof In this section we will give a theoretical basis of the proposed ranking impurity (3), and analyze its capacity in ranking problems. First, some theorems are given to support our ranking impurity. Then we will compare it with the gini impurity and point out the deficiency of gini impurity in ranking settings. k k k Lemma 1 The function F(x1 , x2 , . . . , xk ) = i=1 j=i+1 ai j x i x j + i=1 bi x i , subject to 0 ≤ xi ≤ ω, ai j = 0 for every i, j ∈ {1, . . . k} pair, achieves its extremum value at the boundary. Proof This is a continuous quadratic function on a close-bounded region. So it can attain its extremum at stationary points or the boundary. It suffices to show that the extremum cannot be achieved at any stationary point. For F(x1 , x2 , . . . , xk ), we can get the partial derivative with respect to the jth coordinate x j for every j ∈ {1, . . . , k}. The stationary point is then the solution of the following equation: ⎛ ⎞ ⎞⎛ ⎞ ⎛ −b1 x1 0 a12 · · · · · · a1k ⎜ ⎟ ⎜ ⎜ a21 0 ⎟ a2k ⎟ ⎜ ⎟⎜ x2 ⎟ ⎜ −b2 ⎟ ⎜ ⎜ ⎜ .. ⎟ ⎟ . . . .. .. ⎟⎜ .. ⎟ = ⎜ .. ⎟ ⎜ . ⎟ . (5) ⎜ ⎟ ⎟⎜ ⎟ ⎜ ⎜ . ⎟⎜ . ⎟ ⎜ . ⎟ . . . . .. ⎠⎝ .. ⎠ ⎝ .. ⎠ ⎝ .. ak1 ak2 · · · · · · 0 xk −bk

123

F. Xia et al.

where ai j = a ji . ⎛ ⎞ 0 a12 · · · · · · a1k ⎜ a21 0 a2k ⎟ ⎜ ⎟ ⎜ .. .. ⎟ .. ⎟. . . . Let Ak = ⎜ ⎜ ⎟ ⎜ . ⎟ . . . . .. ⎠ ⎝ .. ak1 ak2 · · · · · · 0 We take the Hessian of F(x1 , x2 , . . . , xk ) to analyze whether the stationary point is an extreme point, which is applicable to be Ak as well. Since tr (Ak ) = 0, Ak must have both positive and negative eigenvalues and cannot be positive definite, thus the stationary point cannot be the extremum. Theorem 1 Given an example set T labeled by a totally ordered label set L = {L 1 , . . . , L k }, let Ni (T ) be the number of elements that have label L i in T , s a split, TL the left node and TR the right node after a splitting. Suppose that N1 (T ) = N2 (T ) = · · · = Nk (T ) = ω and the examples can arbitrarily go to either the left node or the right node. Then the best splitting s of I (T, s) = Irank (T ) − Irank (TL ) − Irank (TR )

(6)

is achieved at [we assume N1 (TL ) = 0]: k is even: N1 (TL ) = N2 (TL ) = · · · = Nk/2 (TL ) = ω, Nk/2+1 (TL ) = Nk/2+2 (TL ) = · · · Nk (TL ) = 0 and N1 (TR ) = N2 (TR ) = · · · = Nk/2 (TR ) = 0, Nk/2+1 (TR ) = Nk/2+2 (TR ) = · · · Nk (TR ) = ω k is odd: N1 (TL ) = N2 (TL ) = · · · = N(k±1)/2 (TL ) = ω, N(k±1)/2+1 (TL ) = N(k±1)/2+2 (TL ) = · · · Nk (TL ) = 0 and N1 (TR ) = N2 (TR ) = · · · = N(k±1)/2 (TR ) = 0, N(k±1)/2+1 (TR ) = N(k±1)/2+2 (TR ) = · · · Nk (TR ) = ω. Proof According to (3), it follows that Irank (T ) =

j k

( j − i)N j (T )Ni (T )

j=1 i=1

Irank (TL ) =

j k

( j − i)N j (TL )Ni (TL )

j=1 i=1

Irank (TR ) =

j k

( j − i)N j (TR )Ni (TR ).

j=1 i=1

Note N1 (T ) = N2 (T ) = · · · = Nk (T ) = ω, and for every i ∈ {1, ...k}, Ni (TL ) + Ni (TR ) = ω.

123

Ranking with decision tree

So for every i, Ni (TR ) = ω − xi , and 0 ≤ xi ≤ ω, if Ni (TL ) = xi . Reformulate (6) as j k (T, s) = − ( j − i)(2xi x j − ωxi − ωx j ), 0 ≤ xi ≤ ω.

(7)

j=1 i=1

According to Lemma 1, (7) achieves the maximum on the boundary. Without loss of generality, we add a boundary condition x j = 0. Replacing x j with 0 in (7), the new function still satisfies the conditions in Lemma 1. This procedure can be repeated until vertices of the boundary are reached. The maximum is achieved at one of the vertices. Comparison among all vertices gives the result. Theorem 1 shows that when the number of examples with different labels are the same at node T , examples with closer rankings tend to go together in a splitting. Now we come to the case where the numbers of examples with different labels are different at a node T. Let us consider the case of k = 3. (Analyses in the splitting for k > 3 can be done in a similar manner.) Theorem 2 Given an example set T labeled by a label set {1, 2, 3}, let Ni (T ) = ωi be the number of elements that have label i in T, s a split, TL the left node and TR the right node after a splitting. Suppose the examples can arbitrarily go to either the left node or the right node. Then the best split will be achieved by the following rules [similar to Theorem 1 we assume N1 (TL ) = 0]: If ω2 ≤ 2ω1 and ω3 ≤ ω1 Then N1 (TL ) = ω1 , N2 (TL ) = N3 (TL ) = 0, N1 (TR ) = 0, N2 (TR ) = ω2 , N3 (TR ) = ω3 , as illustrated in Fig. 1a. If ω2 ≤ 2ω3 and ω1 ≤ ω3 Then N1 (TL ) = ω1 , N2 (TL ) = ω2 , N3 (TL ) = 0, N1 (TR ) = N2 (TR ) = 0, N3 (TR ) = ω3 , as illustrated in Fig. 1b. Otherwise N1 (TL ) = ω1 , N2 (TL ) = 0, N3 (TL ) = ω3 , N1 (TR ) = 0, N2 (TR ) = ω2 , N3 (TR ) = 0 as illustrated in Fig. 1c. Proof Reformulate (6) for the case of k = 3. Similar to the proof of Theorem 1, Lemma 1 proves Theorem 2. Fig. 1 Theorem 2 shows that in the case of k = 3, in order to avoid separating out examples with label 2, the number (of examples) with label 2 should be less than either twice of that with label 1, or twice of that with label 3. Now, let us consider the case of k = 3 for the gini impurity. When the numbers of examples with different labels are the same, e.g., N1 (T ) = N2 (T ) = N3 (T ) = ω, it cannot distinguish the splitting [N1 (TL ) = ω, N2 (TL ) = N3 (TL ) = 0 and N1 (TR ) = 0,

[

[ 0,0, ω3 (b)

[ω1 ,0, ω3

[

[

[ω1, ω 2 ,0

[ 0, ω 2 ,0 (c)

[

[ [

[0,ω 2 ,ω 3 (a)

[ω1 , ω 2 , ω 3

[

[

[ω1 ,0,0

[ω1 , ω 2 , ω 3

[

[ω1 , ω 2 , ω 3

Fig. 1 Three splittings of the ranking tree (k = 3)

123

F. Xia et al.

N2 (TR ) = N3 (TR ) = ω] from the splitting [N1 (TL ) = ω, N2 (TL ) = 0, N3 (TL ) = ω and N1 (TR ) = 0, N2 (TR ) = ω, N3 (TR ) = 0]. In the ranking settings the former should be more “pure” than the latter. Roughly speaking, the ranking impurity emphasizes the role of individual examples while the gini impurity emphasizes the role of individual classes. Meanwhile, the former takes ordinal relationships into account. The ranking impurity groups examples with closer ratings together in each splitting step. Computationally, the two impurity measures share the same goal of making leaf nodes pure. However, the process of these two splittings can be very different because of the greedy nature of tree-based algorithms. The ranking impurity is more suitable for ranking processes in decision trees.

5 Experiments and discussion We made some experiments to compare the Ranking Tree with the perceptron-based ranking algorithms and the Classification Tree on the data set used in [11], including a synthetic data set and several real-world data sets. In our experiments the Classification and Regression Tree (CART) was used as the fundamental decision tree algorithm. The implementation of CART is based on the rpart package in R (refer to http://www.r-project.org). 5.1 Ranking tree with the synthetic data set We generated a synthetic data set using the same data generation process as given in [5,10,11]. Firstly, random points (x1 , x2 ) were generated according to the uniform distribution on the unit square [0,1] × [0,1]. Secondly, each point is assigned a ranking y chosen from the set {1, 2, . . . , 5} using the following ranking rule, y = maxr ∈{1,2,...,5} {r : 10((x1 − 0.5)(x2 − 0.5)) + ε > br }, where b = {−∞, −1, −0.1, −0.25, 1} and ε is normally distributed with T ∧ zero mean and standard deviation of 0.125. We took the average rank loss T1 t=1 | y t − yt , where T is the number of examples in the test set, to measure the performance of algorithms. ∧

It quantifies the accuracy of predictive rankings yt referencing to true rankings yt . Similar to [11], we took 20 Monte-Carlo trials with 50,000 training examples and a separate test set of 1,000 examples. The Ranking Tree and the Classification Tree are applied on these data sets. Cross-validation was made to choose the depth of the tree. Table 1 shows experimental results of our RT algorithms and the Classification Tree comparing with results reported in [11]. The lowest value among the results is boldfaced and the results are represented with their corresponding 95% confidence intervals with the Student’s t distribution. Table 1 shows that perceptron-based algorithms achieve worse performance than algorithms based on decision trees do. The main reason might be the difficulty of getting a good kernel mapping in the problem context. As can be seen from Table 1, the Classification Tree performs as well as the Ranking Tree on the synthetic data. However, the Ranking Tree shows a faster convergence rate with respect to the depth of the tree, as shown in Fig. 2. This coincides with our previous theoretical analysis of the capacity of the ranking impurity. That is, the Ranking Tree could make better partitions than the Classification Tree in term of accuracy as well as speed. The fast convergence rate is very important when time and costs of obtaining values of features are important. The processes of the partitioning of Classification Tree and Ranking Tree are illustrated in Figs. 3 and 4 respectively, to investigate the splitting difference between them. Those algorithms were applied to 1,000 synthetic examples. The symbols (e.g., ‘*’, ‘x’, ‘+’, ‘o’ and ‘.’) denote examples of ranking 1, ranking 2, ranking 3, ranking 4 and ranking 5,

123

Ranking with decision tree Table 1 The average rank loss produced by different algorithms in the synthetic data set

[RT ranking tree, CT classification tree, OAP-VP online aggregate prank-voted perceptron, OAP-Bagg online aggregate Prank-bagging, OAP-BPM online aggregate Prank-bayes point machine, Prank Preceptron ranking algorithm, Prank-VP Prank with voted Perceptron, WH the Widrow-Hoff algorithm, τ, η parameters]

Algorithm

Rank loss

RT with depth = 9

0.16 ± 0.01

CT with depth = 9

0.17± 0.01

OAP-VP with τ = 0.3

0.32 ± 0.01

OAP-VP with τ = 0.6

0.31 ± 0.02

OAP-VP with τ = 0.9

0.31 ± 0.03

OAP-Bagg with τ = 0.3

0.33 ± 0.01

OAP-Bagg with τ = 0.6

0.31 ± 0.02

OAP-Bagg with τ = 0.9

0.32 ± 0.03

OAP-BPM with τ = 0.3

0.23 ± 0.01

OAP-BPM with τ = 0.6

0.24 ± 0.03

OAP-BPM with τ = 0.9

0.26 ± 0.03

Prank

0.37 ± 0.07

Prank-VP

0.31 ± 0.00

WH with η = 0.1

0.30 ± 0.02

Fig. 2 The average five fold cross-validated rank loss of the classification tree and the ranking tree

respectively. From Fig. 3a, we can see that the first splitting of the Classification Tree is near the boundary of the region. This is clearly a consequence of the gini impurity, which favors smaller pure child nodes. However, the splitting near the boundary surely increases the difficulty of fully classifying the entire data set. The Ranking Tree (Figure 4a) in this case makes a much wiser initial splitting in the middle of the region, which facilitates further splittings. The following splittings also prove the same story. It can be seen clearly that, when the depth of the tree is small, the Ranking Tree gives a much more regular model than the Classification Tree. This implies better generalization of the Ranking Tree in the sense of Occam’s Razor, which can be validated by the results shown in Fig. 2.

123

F. Xia et al.

Fig. 3 Five partitions by the classification tree with the average rank loss as 1.227, 1.154, 0.665, 0.350 and 0.183

5.2 Ranking with collaborative filtering We also took the collaborative filtering data set in [11] to make experiments in similar conditions to compare the Ranking Tree with the perceptron-based ranking algorithms and the Classification Tree. The experimental data is composed of two collaborative filtering data sets: Cystic Fibrosis [19] and MovieLens data sets. The Cystic Fibrosis data set consists of

123

Ranking with decision tree

Fig. 4 Five partitions by the ranking tree with the average rank loss as 1.176, 0.474, 0.444, 0.242 and 0.173

items where each entry is given as a query-document-rating triple, and the MovieLens data set as user-item-rating triples. We randomly chose a target ranking yt on an item, and then used the remaining ratings as dimensions of an instance vector x t . The detailed experimental setups for each data set are described as below. The Cystic Fibrosis data set is a set of 100 queries with relevant documents. There are 1,239 documents published from 1974 to 1979 describing Cystic Fibrosis aspects. Each query-document pair has three ratings of highly relevant, marginally relevant and not relevant,

123

F. Xia et al. Table 2 The average rank loss produced by different algorithms in two collaborative filtering data sets Algorithm

Cystic Fibrosis

MovieLens

RT

0.27 ± 0.00 (depth = 6)

0.79± 0.02(depth = 2)

CT

0.39 ± 0.00 (depth = 4)

0.80 ± 0.02 (depth = 1)

OAP-VP with τ = 0.1

0.47 ± 0.01

0.95 ± 0.03

OAP-VP with τ = 0.2

0.45 ± 0.01

0.95 ± 0.03

OAP-VP with τ = 0.3

0.45 ± 0.01

0.95 ± 0.03

OAP-Bagg with τ = 0.1

0.50 ± 0.01

0.96 ± 0.03

OAP-Bagg with τ = 0.2

0.48 ± 0.01

0.95 ± 0.03

OAP-Bagg with τ = 0.3

0.47 ± 0.01

0.96 ± 0.03

OAP-BPM with τ = 0.1

0.40 ± 0.01

0.89 ± 0.03

OAP-BPM with τ = 0.2

0.39 ± 0.01

0.89 ± 0.03

OAP-BPM with τ = 0.3

0.40 ± 0.02

0.90 ± 0.03

Prank

0.50 ± 0.02

1.06 ± 0.03

Prank-VP

0.52 ± 0.01

1.13 ± 0.03

WH with η = 0.2

0.75 ± 0.02

2.19 ± 0.06

WH with η = 0.05

0.48 ± 0.02

1.92 ± 0.07

WH with η = 0.01

0.43 ± 0.02

2.25 ± 0.08

WH with η = 0.001

0.41 ± 0.02

2.74 ± 0.10

which are valued as 3, 2, 1, respectively in our experiments. For each pair, four ratings are assigned. Therefore, a feature vector has three dimensions and a target ranking. The sizes of training set and test set are 4,247 and 572, respectively. As the EachMovie data set used in [11] is not available, we took the MovieLens data from the GroupLens Research Project (http://www.grouplens.org/data/). The data set consists of 100,000 ratings (1–5) from 943 users on 1,682 movies, with each user rating at least 20 movies. Some similar experiments to [11] were performed, where only those people (54 persons) who rated over 300 movies were considered. Consider a specific person as the reference to validate ranking results, the dimension of a movie instance vector is 53. The feature vector of a movie is constructed with ratings by the other 53 persons. Given a person, we searched for the first 300 movies rated by her/him, and formed instances by using her/his ratings as target rankings. If one of them has not seen the selected movie, a value ‘3’ is assigned to it as her/his rating. In our experiments, 210 random movies form the training set and the other 90 movies act as the test set. We applied the Ranking Tree and the Classification Tree to the two collaborative filtering data sets, and compared experimental results with those reported in [11]. The results were averaged on 500 Monte-Carlo trials, as given in Table 2. The lowest values among the results are boldfaced and the results are represented with their corresponding 95% confidence intervals with the Student’s t distribution. We can find that the Ranking Tree significantly outperforms the other two algorithms (the Classification Tree and the perceptron-based ranking algorithms) on the Cystic Fibrosis data set. Perceptron-based algorithms perform relatively worse than algorithms based on decision trees do. The might be the same reason as in synthetic data set that it is difficult to find a good kernel mapping in the problem context. On the MovieLens data, both the Classification Tree and the Ranking Tree prefer trees with fewer nodes. It turns out that stumps (trees with depth 1) perform rather well in the MovieLens

123

Ranking with decision tree

case. This might imply that, given sufficient recommenders, it’s easy to find a recommender with most similar interests and preferences to a given user. An important property of tree classifiers is the natural feature selection. A decision tree uses only a small subset of input features to make predictions. Thus decision tree algorithms are vastly different from kernel algorithms in terms of the model space. Our experiments also proved the fact that decision tree models can approximate many realistic ranking problems much better than kernel algorithms. Interpretability, speed and accuracy are important to many realistic problem-solving applications such as the ranking. Compared to the Classification Tree and the perceptron-based ranking algorithms, the Ranking Tree not only achieves relatively higher accuracy, but also gets faster speed and good interpretability. As a derivation of decision trees, an inherent problem with the Ranking Tree is the instability. Any small change in the data set might result in a very different series of splittings. The major reason for the instability is the hierarchical nature of the splitting process: the effect of errors in early splittings is propagated down to the following splittings. We believe that some ideas relevant to bagging and boosting might be a good solution to the instability problem of the Ranking Tree.

6 Conclusions In this paper, we presented a novel ranking approach based on decision trees, namely the Ranking Tree. With a new impurity measure, the Ranking Tree is an extension of decision trees in the direction of dealing with ranking problems. This paper also provided a theoretical basis for our Ranking Tree algorithms and experimentally validated their effectiveness. Our algorithms were applied to data sets in Harrington’s previous works [11]. Experimental results showed that the Ranking Tree’s performance is closer to the true rankings than other ranking algorithms such as the Classification Tree and the pereptron-based ranking algorithms. We also illustrated the difference between classification and ranking, which proved that the Ranking Tree is more robust than the Classification Tree. In the current stage we mainly deal with totally ordered ratings. There are many ranking problems where the order structure is much more complex. E.g., an example set might have several subsets, and each of them is defined with a different order. It would be interesting to extend the Ranking Tree to deal with ranking problems in these application scenarios. We also intended to investigate the instability inherited in the Ranking Tree resorting to some ideas relevant to bagging and boosting. Acknowledgments This work was supported in part by the National Basic Research Program of China (2004CB318103), National Science Foundation of China (60033020) and Overseas Outstanding Talent Research Program of Chinese Academy of Sciences (06S3011S01), National Key Technology R&D Program (2006BAK31B03).

References 1. Buntine W, Niblett T (1992) A further comparison of splitting rules for decision-tree induction. Mach Learn 8:75–85 2. Burges C, Shaked T, Renshaw E et al. (2005) Learning to ranking using gradient descent. In: Proceedings of the 22nd international conference on maching learning (ICML-2005), Bonn, Germany, pp 89–96 3. Breiman L, Friedman J, Olshen RA et al (1984) Classification and regression trees. Wadsworth, Belmont 4. Chu W, Ghahramani Z (2005) Gaussian processes for ordinal regression. J Mach Learn Res 6:1019–1041

123

F. Xia et al. 5. Crammer K, Singer Y (2002) Pranking with ranking. Advances in Neural Information Processing Systems 14. MIT Press, Cambridge, pp 641–647 6. Cohen WW, Schapire RE, Singer Y (1999) Learning to order things. J Artif Intell Res 10:243–270 7. Freund Y, Iyer R, Schapire RE et al (2003) An efficient boosting algorithm for combining preferences. J Mach Learn Res 4:933–969 8. Genest D, Chein M (2004) A content-search information retrieval process based on conceptual graphs. Knowl Inform Syst 8:292–309 9. Hastie T, Tibshirani R, Friedman J (2001) The elements of statistical learning: data mining, inference, and prediction. Springer, New York 10. Herbrich R, Graepel T, Obermayer K (2000) Large margin rank boundaries for ordinal regression. Advance in large margin classifiers. MIT Press, Cambridge, MA. pp 115–132 11. Harrington EF (2003) Online ranking/collaborative filtering using the perceptron algorithm. In: Proceedings of the twentieth international conference on machine learning (ICML-2003), Washington, DC 12. Li T, Zhu S-H, Ogihara M (2006) Using discriminat analysis for multi-class classification: an experimental investigation. Knowl Inform Syst 10:453–472 13. Mitchell TM (1997) Machine Learning. The McGraw-Hill New York, pp 52–78 14. Quinlan JR (1986) Induction of decision trees. Machine Learning 1:81–106 15. Quinlan JR (1993) C4.5: programs for Machine learning. Morgan, Kaufmann 16. Rosset S, Perlich C, Zadrozny B (2007) Ranking-based evaluation of regression models. Knowl Inform Syst 12:331–353 17. Spink A, Greisdorf H, Bateman J (1998) From highly relevant to not relevant: examining different regions of relevance. Inform Process Manag Int J 34:599–621 18. Shen L, Joshi AK (2005) Ranking and reranking with perceptron. Mach Learn 60:73–96 19. Shaw WM, Wood JB, Wood RE, et al. (1991) The cystic fibrosis database: content and research opportunities. LISR 13:347–366 20. Shashua A, Levin A (2003) Ranking with large margin principle: two approaches. In: Proceedings of the conference on Neural information processing systems, (NIPS) 14 21. Thabtah F-A, Cowling P, Peng Y-H (2005) Multiple lables associative classification. Knowl Inform Syst 9:109–129

Author’s biography Fen Xia is a Ph.D. candidate in the Key Lab of Complex System and Intelligent Science at Institute of Automation Chinese Academy of Sciences. He received his Bachelor degree in Automation at the University of Science and Technology of China (USTC) in 2003. His research interests include statistical machine learning, ranking, regularization methods, efficient algorithms and image processing.

123

Ranking with decision tree Wensheng Zhang received the Ph.D. degree in Pattern Recognition and Intelligent Systems from the Institute of Automation, Chinese Academy of Sciences (CAS), in 2000. He joined the Institute of Software, CAS, in 2001. He is a Professor of Machine Learning and Data Mining and the Director of Research and Development Department, Institute of Automation, CAS. He has published over 32 papers in the area of Modeling Complex Systems, Statistical Machine Learning and Data Mining. His research interests include Intelligent Information Processing, Pattern Recognition, Artificial Intelligence and Computer Human Interaction.

Fuxin Li is a Ph.D. candidate in the Key Lab of Complex System and Intelligent Science at Institute of Automation, Chinese Academy of Sciences. He received his Bachelor degree in Computer Science and Technology in Zhejiang University, China in 2001. His research Interests include metric learning, regularization methods, statistical machine learning, learning with existing knowledge, semi-supervised learning and bioinformatics.

Yanwu Yang received the Ph.D. degree in computer science from the doctoral school of the Ecole Nationale Superieure d’Arts et Metiers (ENSAM), France in 2006. He joined the lab of Complex Systems and Intelligence Sciences in 2007. His current interests include user model, Human–Computer Interaction and text mining.

123

Factoring Decision Tree

Decision Tree State Clustering with Word and ... - Research at Google

Ranking policies in discrete Markov decision processes - Springer Link

Cross-validation based decision tree clustering for ...

cross-validation based decision tree clustering for hmm ...

DDD-DVRS-Employment-Decision-Tree Final.pdf

A Bottom-Up Oblique Decision Tree Induction Algorithm

Mutual Information Phone Clustering for Decision Tree ...

Neural Networks, Decision Tree Induction and ...

Get Deal With Tree Lopping In Canberra With Perfection.pdf ...

Tree models with Scikit-Learn Great learners with little ... - CiteSeerX

ranking geral_valeEscolar.pdf

Ranking Arizona - Snell & Wilmer

Group Bayesian personalized ranking with rich ...

Object Proposal with Kernelized Partial Ranking

Web Image Retrieval Re-Ranking with Relevance Model

Ranking with query-dependent loss for web search

Ranking linear budget sets with different available goods

FRank: A Ranking Method with Fidelity Loss - Microsoft

Learning Fine-grained Image Similarity with Deep Ranking (PDF)

FRank: A Ranking Method with Fidelity Loss

Matching and Ranking with Hidden Topics towards ...

FRank: A Ranking Method with Fidelity Loss - Microsoft

Social Image Search with Diverse Relevance Ranking - Springer Link