C3: A New Learning Scheme to Improve Classification of Rare Category Emails Jie Yang1 , Joshua Zhexue Huang2 , Ning Zhang1 , and Zhuoqun Xu1 1
Department of Computer Science and Technology Peking University, Beijing, 100871, P.R.China {yangjie,nzhang}@ebusiness.pku.edu.cn,
[email protected] 2 E-Business Technology Institute University of Hong Kong, Pokfulam Road, Hong Kong
[email protected]
Abstract. This paper1 proposes C3, a new learning scheme to improve classification performance of rare category emails in the early stage of incremental learning. C3 consists of three components: the chief-learner, the co-learners and the combiner. The chief-learner is an ordinary learning model with an incremental learning capability. The chief-learner performs well on categories trained with sufficient samples but badly on rare categories trained with insufficient samples. The co-learners that are focused on the rare categories are used to compensate for the weakness of the chief-learner in classifying new samples of the rare categories. The combiner combines the outputs of both the chief-learner and the co-learner to make a finial classification. The chief-learner is updated incrementally with all the new samples overtime and the co-learners are updated with new samples from rare categories only. After the chieflearner has gained sufficient knowledge about the rare categories, the co-learners become unnecessary and are removed. The experiments on customer emails from an e-commerce company have shown that the C3 model outperformed the Naive Bayes model on classifying the emails of rare categories in the early stage of incremental learning.
1
Introduction
Many applications of text classification systems follow the following scenario. Given a set of samples collected before time t0 , first classify the samples into m categories or classes (usually manually) according to some defined business criteria. Then divide the samples into a training set and a test set, use a classification algorithm to build a model from the training set and test it with the test samples. After that, install the model to classify new samples coming after time t0 and verify the results of classification through other means. Finally update the model with new samples. A typical example is automatic classification and routing of customer emails that is adopted in many companies. In these systems, 1
Supported by the National Natural Science Foundation of China (No 60003005) and IBM.
T.D. Gedeon and L.C.C. Fung (Eds.): AI 2003, LNAI 2903, pp. 747–758, 2003. c Springer-Verlag Berlin Heidelberg 2003
748
Jie Yang et al.
a model is first built from selected customer emails and then applied to classify new customer emails in real time and route them to corresponding people to handle. The people who process these emails identify the misclassified emails, correct the classification results and save the results in the email database. These identified emails are later used to update the model. These real world applications require the learning model to possess the following capabilities: 1. Incremental learning. Since the misclassified emails are continuously fed into the email database, the model must be able to incrementally learn the new knowledge embedded in these emails to avoid future mistakes. 2. Adaptive to new categories. Some new emails may not fall into any of existing categories in the model. The knowledge of the new categories can be added incrementally to the model without rebuilding it. 3. Learning quickly from few examples. In many application domains, the distribution of samples of different categories is uneven. Some categories have much less samples than others. It is difficult for the learning model to learn the characteristics of these categories due to insufficient samples. A few learning algorithms [2], such as Naive Bayes, neural networks and super vector machines, possess the incremental learning capability. Adding new categories to a model is straightforward to the Naive Bayes algorithm and achievable in neural networks and super vector machines. The challenge to a learning algorithm is the speed of incremental learning. If new samples come slowly, most learning algorithms take long time to learn sufficient knowledge of the rare categories, because their incremental learning is done either through modification of the model architecture such as a neural network architecture [6] or readjustment of model parameters during the learning process such as the weights of a neural network or probabilities of a Naive Bayes model [4][5][7]. When classifying new samples of rare categories, the model usually performs badly until sufficient knowledge is learnt about the categories. In this paper, we propose a new learning scheme, called C3, that uses three learning components to improve the learning performance on rare categories. The first component chief-learner is an ordinary learning model with the incremental learning capability. In the initial model building stage the chief-learner can learn sufficient knowledge on the categories with enough training samples and perform well on classification of new samples of these categories. However, it performs badly on some rare categories because it is insufficiently trained. As time goes and more new samples are fed to the model incrementally, the knowledge about the rare categories augment and its classification accuracy on the new samples of the rare categories increases. To compensate for the weakness of the chief-learner in classifying samples of rare categories in the early stage, we use a group of co-learners to quickly gain the knowledge of these categories. The co-learners contain only the knowledge of rare categories and therefore are more sensitive to the new samples of rare categories. The outputs of the chief-learner and the co-learners are combined in the
C3: A New Learning Scheme to Improve Classification
749
combiner in such a way that if the received new sample is of a rare category, the output of one co-learner will boost its chance to be classified correctly. The new samples of rare categories are fed to both the chief-learner and the co-learners to incrementally update the models. The more knowledge about the rare categories is learnt by the chief-learner, the less contribution to the classification is made by the co-learners. After the chief-learner has accumulated enough knowledge about the rare categories and is able to classify their samples independently, the co-learners become unnecessary and dead. The age of a co-learner depends on the learning speed of the chief-learner on that category. We have implemented the C3 scheme using the Naive Bayes model as the chief-learner and instance-based learning technique as the co-learners. The combiner is designed as (1 − f (t))PCf + f (t)PCo where PCf and PCo are the outputs of the chief-learner and co-learners respectively and f (t)t→∞ → 0. Experiments on classification of customer emails from an e-commerce company have shown that the new learning scheme raises classification accuracy of the rare email categories in the early stage of the model. The reminder of this paper is structured as follows: In Section 2, we introduce the framework of C3 and explain its incremental learning process and combination method. The implementation is addressed in Section 3. Section 4 presents the experimental results. Finally, we give a brief conclusion and point out our future work in Section 5.
2 2.1
C3 Scheme C3 Framework
The C3 scheme is a variant of the ensemble learning algorithms [8] that combine the outputs of multiple base learners to make the final classification decision. Unlike most ensemble algorithms who see no differences among their base classifiers, C3 employs two kinds of base classifiers aiming at improving the classification performance at the early stage of incremental learning. The learning system of C3 consists of three components: a chief-learner, a set of co-learners and a combiner. Figure 1 shows the relationships of these three components. The chieflearner Cf (t) is an ordinary classifier that has an incremental learning capability. Let D(t0 ) be a training data set collected at time t0 and classified into m categories. Cf (t0 ) is a classifier built from D(t0 ) with a learning algorithm. Due to the uneven distribution of categories in the training data set, Cf (t0 ) may not perform on some rare categories whose samples are not sufficient in D(t0 ). The co-learners Coi (t0 ) are built to compensate for the weakness of Cf (t0 ) in classifying the new samples of the rare categories. Each Coi (t0 ) is built only from the samples of the rare category ci . It turns to produce a strong output when it classifies a new sample of category ci and a weak output if the sample is in other categories. The combiner Cm (t) combines the outputs from Cf (t) and Coi (t) to determine the final classification on a new sample. If a new sample is in a rare
750
Jie Yang et al.
Cf(t)
1
Co(t)
document d
C5o(t)
PCf(t)
PCo1 (t) PCo5 (t)
Cm(t)
c*
... m
Co(t)
PCom(t)
Fig. 1. The framework of C3 scheme. The combiner Cm (t) combines the outputs from the chief-learner Cf (t) and the co-learners Coi (t) category ci , Cf (t) tends to produce a weak output while the co-learner Coi (t) will generate a strong output. Let PCi f (t) and PCoi (t) be the outputs of Cf (t) and Coi (t) given input sample d from category ci respectively. The output of the combiner Cm (t) is calculated as c∗ = argmaxi∈C ((1 − fi (t))PCi f (t) + fi (t)PCoi (t))
(1)
where C is the set of all categories, fi (t)t→∞ → 0 is a decay function and fi (t) = 0 if ci is not a rare category. From (1) we can see that if a new sample d belongs to a category cj that is not a rare category, PCj f for category cj will be large while PCf for other categories will be small. All PCo will be small. Therefore, the final classification will be determined only by PCj f . If the new sample is in a rare category ci , the chief-learner output PCi f may not be greater than those of other categories and the sample is likely misclassified by the chief-learner. However, the co-learner for category ci will generate a large PCoi that will compensate for the weakness of the chief-learner result through the combiner (1) and a correct classification can be obtained. As more examples of a rare category ci are learnt by the chief-learner Cf (t) overtime, PCi f (t) for category ci increases. Because the decay function fi (t) decreases as more samples for category ci are received, the effect of PCoi (t) also decreases to avoid bias on category ci . At time tn when sufficient knowledge about category ci has been learnt by the chief-learner, fi (tn ) → 0. The colearner for category ci becomes unnecessary and dead. This process indicates that the co-learners are dynamic. When a sample of a new category is presented to the system, a co-learner for that category is created. When enough knowledge of a rare category is learnt by the chief-learner, the corresponding co-learner is removed. The life cycles of co-learners are determined by the decay functions fi (t). 2.2
Incremental Learning Process
The incremental learning process of a C3 learning system is illustrated in Figure 2. At the initial stage time t0 , a set of training samples from a learning domain
C3: A New Learning Scheme to Improve Classification
t0
Cm
s1
Cm
C4o state 1
tn c1c2c3c4
c1c2c3c4
c1c2c3c4
c1c2c3
state 0
t2
t1
s2
Cm
C4o state 2
751
... sn
Cm
state n
Fig. 2. The incremental learning process of C3. The new category c4 is added at time t1 . At time tn , the chief-learner is well trained on this category and the co-learner Co4 is removed is collected. Assume the samples are classified into 3 categories {c1 , c2 , c3 } and all categories have enough samples. The initial model of the chief-learner is built from the initial samples and no co-learner is created. At time t1 , few samples of a new category c4 are identified. The chief-learner model built at time t0 is unable to classify them. The knowledge of category c4 is learnt incrementally by the chief-learner. A co-learner for this category is created from the few samples. The updated learning system now contains partial knowledge about category c4 , therefore can partially process new coming samples of that category. At time t2 , more c4 samples have been present to the system, some being classified correctly and some classified wrongly. These samples are fed to the learning system and the chief-learner has learnt more complete knowledge about c4 , therefore, being able to classify new coming samples with a greater confidence. At time tn , enough c4 samples have been learnt by the chief-learner that has gained sufficient knowledge about c4 through the incremental learning process and is able to classify new coming c4 samples correctly. The co-learner for category c4 has become unnecessary, therefore being removed from the learning system. The above process describes the life cycle of one co-learner. In a real dynamic learning process, a few rare categories can be present in the same or different time periods. At any time ti , old co-learners can be removed and new co-learners can be created. Also the life cycles of different co-learners are different, depending on the coming speed of the new samples of the rare categories. However, the C3 learning system is adaptive to this dynamic process. One important part of the dynamic learning process is the identification of samples of rare categories and feeding of the correct samples to the learning system. In real applications, this part is often performed by human being.
3
An Implementation
In this section, we present an implementation of the C3 scheme that uses the Naive Bayes model as the chief-learner and the instance-based techniques as the co-learners. We use the learning curve generated from the training samples to determine the combiner.
752
3.1
Jie Yang et al.
The Naive Bayes Chief-Learner
We choose the Naive Bayes model [1] as the chief-learner in our C3 implementation for its simplicity, capability in incremental learning and performance in text document classification which is our application domain. In the Bayesian learning framework, it is assumed that the text data is randomly generated by a parametric model (parameterized by θ). The estimates of the parameters θˆ can be calculated from the training data. The estimated parameters then can be used to predict the class of new documents by calculating the posterior probability of the new documents belonging to the existing classes in the model. Let D be the domain of documents that has |C| categories and p(di |θ) a model that randomly generates a document di ∈ D according to p (di |θ) =
|C|
p (cj |θ) p (di |cj ; θ)
(2)
j=1
where p(cj |θ) is the probability of category cj and p(di |cj ; θ) is the prior probability of document di belonging to category cj . In this paper we assume that p(di |cj ; θ) is a multinomial model[1]. Let W be the word vocabulary, docu-
ment di is represented as a vector wdi,1 , wdi,2 , . . . , wdi,|d | , where wdi,j ∈ W is i the jth word in document di . With the Naive Bayes assumption on the independence of the word variables, the probability of document di in class cj is given by |di | p wdi,k |cj ; θ (3) p (di |cj ; θ) = p (|di |) k=1
|W | where θ = (θj , θj,t ), θj ≡ p (cj |θ), θj,t ≡ p (wt |cj ; θ) and t=1 θj,t = 1. Let N (wt , di ) be the count of word wt in document di , and define P (cj |di ) ∈ {0, 1}, as given by the documents class label. Given training documents set D, θj,t can be estimated by θˆj,t =
1 + |D| i=1 N (wt , di ) P (cj |di ) |W | |D| |W | + s=1 i=1 N (ws , di ) P (cj |di )
(4)
and θj is estimated by θˆj =
|D|
P (cj |di ) / |D|
(5)
i=1
Given θˆj and θˆj,t , the posterior probability of document di belonging to class cj can be calculated by
|di | ˆ p w |c ; θ p cj |θˆ d j i,k k=1
p cj |di ; θˆ = (6) |di | |C| ˆ ˆ p c | θ p w |c ; θ r d r i,k k=1 r=1
C3: A New Learning Scheme to Improve Classification
753
Incremental learning is performed by re-calculating new θˆ based on previous N (wt , di ), P (cj |di ) and the new training sample. To reduce the vocabulary size and increase the stability of the model, feature selection is conducted by selecting words that have the highest mutual information with the class variable. 3.2
Instance-Based Co-learners
A co-learner for a rare category cj is built from |Dj |(≥ 1) samples where Dj is the set of samples from category cj . Let di = wi,1 , wi,2 , . . . , wi,|W | be a document in Dj where wi,j is the frequency of word wj ∈ W in document di and 1 ≤ i ≤ |Dj |. The co-learner for category cj is modeled by center of the training samples as (7) vj,1 , vj,2 , . . . , vj,|W | where
|Dj |
vj,k =
wi,k / |Dj |
(8)
i=1
Given a new document d = w1 , w2 , . . . , w|W | , we first calculate the Euclidean distance between the document and the center as
|W | 2 dvj = (wk − vj,k ) (9) k=1
We then map dvj to its probability of the normal distribution as 2
e−(dvj −µ) /2σ 2 √ g dvj = σ 2π
(10)
where µ = 5.5 and σ = 1 in our settings. 3.3
The Combiner
We use a decay function f (t) to balance the influence of the chief-learner and the co-learners in the combiner when making the final classification. In learning knowledge of a new category, the co-learner learns much faster than the chieflearner. Figure 3 shows the learning curves of the chief-learner and a co-learner. The learning speed of the co-learner Co (t) is faster than the chief-learner Cf (t) in the early stage of the model. The co-learner is more accurate than the chief-learner in classification of a rare category samples. Therefore, more weight should be put on the co-learner. As time goes and more knowledge about the rare category is learnt by the chief-learner, its classification accuracy increases. Then the weight on the co-learner decreases in order to avoid bias on this category and result in overfitting. In this work we define the decay function as 1 , fj (t) = 4 agej (t)
t≥1
(11)
754
Jie Yang et al. accuracy 1
Co(t)
0
Cf(t)
time t (no. of training samples)
Fig. 3. Learning curves of the chief-learner Cf (t) and the co-learner Co (t)
where agej (t) is defined as the total number of samples on rare category cj fed to C3 at time t. Thus the combiner of (1) is implemented as 1 1 )PCi f (t) + PC i (t)) c∗ = argmaxi∈C ((1 − 4 4 agei (t) agei (t) o
4
(12)
Experimental Results
The implementation of the C3 scheme was tested with customer email data collected from a small e-commerce company in Hong Kong that sells products to global customers through emails. The company receives daily about 100 customer emails that order products, pay with credit cards, query about products, prices and services, and return products. The emails are written in different languages. Currently, an email categorization system is employed to classify incoming emails and route them to the corresponding people to process. All classification results are saved in an email database. If an email has been classified wrongly, the person who processes it will reclassify it and save the right result in the email database for model update. 4.1
E-mail Data Sets
971 customer emails received in 10 consecutive days in September 2002. These mails were manually classified to 17 categories according to some business criteria. In this experiment we threw away the junk emails and selected only 8 categories with more than 20 emails. We divided these emails into three groups. Group 1 contains categories of “Discount Question”, “Item Question”, “Order Address Amend”. Group 2 contains categories of “Credit card Refused”, “Over Charge”, “Product Inquiry” and “Order Amend”. Group 3 contains categories of “Order Status Question”, “Over Charge” and “Product Inquiry”. Two categories appeared in two groups without any particular reason. We divided emails in each group into a training set and a test set. For the training set we chose one category as the rare category and assumed that emails in this category arrived in separate days. The emails in other categories collected before the date when the model was built. The settings of training emails are
C3: A New Learning Scheme to Improve Classification
Table 1. Training sets. Categories marked with arrived in separate days
∗
755
were the rare categories that
Group1 Category
day0 day1 day2 day3 day4 day5 day6 day7 day8
Discount Question 14 Item Question 13 Order Address Amend∗
1
6
11
16
21
26
31
36
Group2 Category
day0 day1 day2 day3 day4 day5 day6 day7 day8
Credit Card Refused Over Charge Product Inquiry Order Amend∗
17 31 60 1
6
11
16
21
26
31
36
Group3 Category
day0 day1 day2 day3 day4 day5 day6 day7 day8 day 9
Order Status Question 149 Over Charge 43 Product Inquiry∗
1
11
21
31
41
51
61
71
81
Table 2. Test sets Group1 category
sample no.
Discount Question 13 Order Address Amend 18
category
sample no.
Item Question 11
Group2 category
sample no.
category
sample no.
Credit Card Refused Product Inquiry
15 15
Over Charge Order Amend
30 14
sample no.
category
sample no.
Over Charge
18
Group3 category
Order Status Question 63 Product Inquiry 23
shown in Table 1. We used one test data set for each group due to lack of sample emails. The same test set was used to test all updated models obtained from different dates. The number of samples for each category is given in Table 2.
756
Jie Yang et al.
1
1
Navie Bayes C3 original
0.9
Navie Bayes C3
0.8
0.8
F1
F1
0.6 0.7
0.4 0.6 0.2
0.5
0.4
0 0
5
10
15
20
25
30
# of training samples
35
40
,
0
5
10
15
20
25
30
35
40
# of training samples
Fig. 4. Experimental results on Group1. Left side shows F1 obtained from all the testing samples in Group1 and the right side that of the rare(Order Address Amend) category
4.2
Model Building and Incremental Learning
We first used training emails on Day 0 to build a Naive Bayes model and tested it with the test emails without the rare category. Then, we updated the model with emails of the rare category on Day 1 and tested the updated model with all test emails in a group. Since a new category was added on Day 1, we created a co-learner for the rare category to build a C3 model. We then tested the C3 model with the test emails. We continued this process for all the consecutive days and evaluated the classification performance of each updated model with the test emails. The classification performance of each updated model was measured by Fβ =
(β 2 + 1) × p × r β2p + r
(13)
where p and r are the precision and recall [3] calculated from the test data set. In our approach, we regard precision and recall equally important, so we set β = 1. Since precision and recall are not isolated from each other, this measure avoids bias on either. 4.3
Result Analysis
Figure 4 shows the experiment results of Group 1 data. The horizontal axis represents the number of the rare category emails accumulatively fed to the model in consecutive dates. Each point on the curves represents a particular date when the model was updated incrementally and tested. The solid curve is the result of the Naive Bayes model and the dotted curve is the result of the C3 model. Each curve is the average result of three independent experiments on the Group 1 data. From the left side of Figure 4, one can see the chief-learner Naive Bayes model performed very well on Day 0 when no email of the rare category was
C3: A New Learning Scheme to Improve Classification
1
1
Navie Bayes C3 original
0.9
757
Navie Bayes C3
0.8
0.8
F1
F1
0.6 0.7
0.4 0.6 0.2
0.5
0.4
0 0
5
10
15
20
25
30
35
40
# of training samples
1
0
,
5
10
15
20
25
30
35
40
# of training samples
1
Navie Bayes C3 original
0.9
Navie Bayes C3
0.8
0.8
F1
F1
0.6 0.7
0.4 0.6 0.2
0.5
0.4
0 0
10
20
30
40
50
# of training samples
60
70
80
90
,
0
10
20
30
40
50
60
70
80
90
# of training samples
Fig. 5. Experimental results on Group2(top) and Group3(bottom)
present. On Day 1 when one rare category email was learnt, the performance of the Naive Bayes model dropped dramatically, the classification accuracy of the C3 model could still achieve more than 60% even if only one email was present in the co-learner. The effectiveness of the co-learner in the early stage of learning a rare category was obvious. On Day 2, 5 emails of the rare category were learnt by the models, the performance of the Naive Bayes model improved a little but the performance of the C3 model decreases slightly due to the disturbance of the data. On Day 3, 5 more emails of the rare category were learnt again. Both the Naive Bayes model and the C3 model showed a significant increase in classification performance because more knowledge has been accumulated in both models. However, the C3 model performed better than the Naive Bayes. On Day 6, 26 emails of the rare category had been learnt. The two models achieved the best performance because the sufficient knowledge on that category had been accumulated. After Day 6, the performance of the Naive Bayes model remained stable. However, the performance of the C3 model dropped slightly. This is due to the overfitting of the C3 model caused by the co-learner. When this point arrived, the function of the co-learner should stop because the Naive Bayes could perform the function alone. The right side of Figure 4 shows the only result of the rare category emails. The advantage of the C3 model on classifying the rare category emails in the early learning stage is quite clear. Figure 5 shows the results of Group 2 and Group 3 data sets. Similar trends can be observed, which demonstrated that the co-learner was effective in enhancing the classification accuracy of the rare category in the early stage of the
758
Jie Yang et al.
model. However, the parameters of the co-learner and the combiner have to be well adjusted. In this experiment, the parameters of the normal distribution distance scaling of the co-learner was set as µ = 5.5 and σ = 1. The decay function 1 . on co-learner was fi (t) = √ 4 agei (t)+0.2
5
Conclusions
In this paper we have described C3, a new learning scheme to use co-learners to boost the classification performance of a classification model on the rare categories in the early stage of incremental learning. We have discussed the implementation of the C3 scheme with the Naive Bayes model as the chief-learner and instance-based learning techniques as the co-learners. Our preliminary experiments on classification of limited customer emails have demonstrated that the co-learners can improve the classification performance of the rare category emails in the early learning stage of the model. Although the initial results are encouraging, more experiments on diverse data from different domains to further test the effectiveness of the new scheme are necessary. We will continue experimenting with more customer emails in more categories and further study the parameters in the co-learners and the combiner. Furthermore, we will study other implementations with different learning models such as super vector machines and k-nearest neighbors.
References [1] A. McCallum and K. Nigam, A Comparison of Event Models for Naive Bayes Text Classification, AAAI-98 Workshop on Learning for Text Categorization (1998). 752 [2] C. Giraud-Carrier, A Note on the Utility of Incremental Learning, AI Communications 13(4) (2000), 215-223. 748 [3] F. Sebastiani, Machine Learning in Automated Text Categorization, ACM Computing Surveys 34(1) (2002), 1-47 756 [4] J. D. M. Rennie, ifile: An Application of Machine Learning to E-Mail Filtering, Proceedings of the KDD-2000 Workshop on Text Mining (2000). 748 [5] R. B. Segal and J. O. Kephart, Incremental Learning in SwiftFile, Proceedings of The 17th International Conference on Machine Learning (2000), 863-870. 748 [6] S. E. Fahlman and C. Lebiere, The Cascade-Correlation Learning Architecture, Advances in Neural Information Processing Systems 2 (1990), 524-532. 748 [7] S. K. Chalup, Incremental Learning in Biological and Machine Learning Systems, International Journal of Neural Systems 12(6) (2002), 447-465. 748 [8] T. G. Dietterich, Machine-Learning Research: Four Current Directions, AI Magazine 18(4) 1997, 97-136 749