Combining Local Feature Scoring Methods for Text ... - Semantic Scholar

Viewer
Transcript

Combining Local Feature Scoring Methods for Text Categorization Nayer M. Wanas ;

Dina A. Said ;

Nadia I. Hegazy ;

and Nevin M. Darwish

Pattern Recognition and Information Systems Group Department of Computer Engineering Informatics Department Faculty of Engineering Electronics Research Institute Cairo University Cairo, Egypt Cairo, Egypt [email protected], [email protected], [email protected], [email protected]

Abstract Dimensionality reduction is an important process in text categorization. Feature scoring methods are used in order to realize this reduction. Features are evaluated and selection is performed according to a certain threshold. In this paper, we propose combining pairs of high-performing feature scoring methods to enhance text categorization. We analyzed the performance of constructing this combining by using three operators; the union operator (UN), the union-cut operator (UC), along with the intersection operator (INT) in order to increase the confidence in the selected features. The results suggested that these combining operators, when applied on feature selection methods with comparable performance achieves an improvement. Generally, the UC operator demonstrated the best enhanced performance in classifying frequent categories whereas the UN operator was effective in the classification of rare categories. Additionally, the INT operator showed some potential in terms of storage reduction and performace improvement.

1

Introduction

Text Categorization (TC) is the process of assigning one or more label to a given text. This process is considered as a supervised classification, since a collection of labeled (pre-classified) documents is provided. The goal is to assign a label to a newly encountered, yet unlabeled, pattern [7, 23]. TC can play an important role in a wide variety of applications such as spam filtering [31], news recommendation [3], word sense disambiguation [19], email classification [24], topic detection and tracking [26] , webpage classification [29], and Topical crawling [20]. Text categorization can be divided into four processes, namely document pre-processing, dimensionality reduction, feature weighing, and classification. In the document pre-processing phase, eliminating punctuation and common words such as ’a’, ’the’, etc. . . is performed. Words are then extracted from the document, stemmed, and pre-

sented in the form of as bag-of-words (BOW) [2]. Dimensionality reduction is achieved by selecting some features that have higher importance to the classification process. While term extraction is possible, term selection is the most popular approach [18, 23]. Term selection can be done by filtering the features according to some computed score [28]. The selected features are then weighted and presented in the feature weighting phase. This has been performed commonly using term frequency-inverse document frequency (tf − idf ) technique [23] where tf measures the importance of the term within the document and idf measures the general importance of the term [22]. Finally, classification is performed through supervised learning. The selected features are used to learn how to distinguish among the different document categories. Several Classifiers have been used in TC. Among them are Knearest neighbors [25], Neural Network [16], Linear classifiers [15], Decision rules [1], Maximum Entropy [17], and Support Vector Machine (SVM) [12]. Studies have shown that SVM is among the best performing classifiers in TC applications [6, 8, 27]. One of the most important challenges that face the research in TC is the reduction of the high dimensionality of documents features. This reduction can decrease the computational resources, storage , and the memory required to manage these features [18]. Term selection is performed by evaluating scoring methods locally on each category in the training set. Thresholding is then performed to select the features that have the highest scores [6]. Thresholding is done according to either of two policies: (a) a local policy, or (b) a global policy. In the local policy, thresholding is performed locally on each category to compose the final representative feature set from it [7]. On the other hand, a globalization schema is applied to extract a single global score for features by commonly maximizing or weightily averaging the local feature scores in each category [28]. This is followed by applying thresholding on these global scores to compose the final feature set that represents the whole training set. The decision to commit to certain features affects not

only the computational and storage requirements, but also the quality of the classification. The choice of features is usually done through an empirical study, where one group of features is selected to represent a certain document set. However, the accuracy of classification can reach a saturation point based on the features selected. The reduction of the effect of the initial selection or conditions for classification has been the premises of work in the area of multiple classifier systems. This follows logically from the way that human consults different approaches and conditions to reach a final decision. The aggregation can be performed at different levels. It can be at the decision level, feature selection level, or the data level [5]. In this work, we will be concerned with aggregation at the feature selection level. Rogati and Yang [21] examined the performance of merging some feature scoring methods globally after applying either maximization or averaging. The combining approach used in their study was based on normalizing the feature scores of each word and taking the maximum. They then performed the thresholding on the combined list. They concluded that using χ2 combined with either DF (Document Frequency) or IG (Information Gain) leads to improved performance. Combining feature scores using this approach requires the normalization of both lists. However, since the used feature score methods are different in nature normalization could be challenging. The following sections elaborate the performance of combining lists of features formed by feature selection methods applied locally on each category. These selection methods are the DF, IG, MI (Mutual Information), and CC (Correlation Coefficient). This combined list was formed by using different operators; the union, the union-cut, or the intersection of feature pairs.

2

Feature Selection Methods

Several feature selection measures have been explored in the literature. In this study we will focus on Document Frequency (DF), Information Gain (IG), Mutual Information (MI), and Correlation Coefficient (CC). • Document Frequency (DF) can simply be calculated by counting the number of documents in which a certain word occurs [28]. • Information Gain (IG) is the number of bits gained for a category by knowing the presence or absence of a word in a document [23]. IG is defined to be

IG(wk , ci ) =

X

X

p(w|c)log

c∈{ci ,ci } w∈{wk ,wk }

p(w|c) , p(w)p(c) (1)

where wk is a single word in a category ci . • Mutual Information (MI) means the most relevant words to a category are those that only occur in a document that belongs to the category of interest. According to [2], MI is defined as follows:

M I(wk , ci ) = p(wk |ci )log

A×N . (2) (A + C)(A + B)

where N is the number of documents in the training set, A is the number of times word wk and category ci co-occur, B is the number of time the wk occurs without ci , and C is the number of times ci occurs without wk . • Correlation coefficient (CC) is the square root of χ2 which is defined as [28]: χ2 (wk , ci ) =

N × (AD − CB)2 . (A + C)(A + B)(D + C)(D + B) (3)

where D is the number of times where neither ci nor wk occurs. The χ2 measures the lack of in-dependency between a word w and a category c. χ2 can be considered as a normalized form of MI [28]. However, it suffers from equaling the probabilities of negative correlation (CB) and positive correlation (AD) [16]. CC overcomes this problem by taking the square root, and has been shown to outperform χ2 [16]. CC is defined as:

CC(wk , ci ) = p

√

N × (AD − CB)

(A + C)(A + B)(D + C)(D + B) (4)

These feature selection methods have been widely used, and have showed promising results [23,28]. We propose combining lists of features produced by these methods using the following three operators: • Union (UN): The UN operator aggregates feature selection approaches. As a result of aggregation, the number of selected features will be greater than or equal the threshold level used. The exact equivalent threshold (UTh) equals to: U T h = (T h × Sim + 2 × T h(1 − Sim)),

(5)

where T h is the used threshold and Sim is the similarity between the combined lists. • Union-cut (UC): The UC operator produces a list which is mainly a subset of the union list but its size is limited to be equal to the threshold level used. This is done by taking off the least DF features that are produced with only one of the features selection pairs. • Intersection (INT): The INT operator selects the common features of the used feature selection approaches. This list represents the intersection of the features selected by each scoring pair. Evidently,

.

this list will be less then or equal the number of features at the threshold level used. The equivalent threshold (ITh) will be equal to:

smallest category consists of only one training example. Applying the proposed operators on these diverse data-sets will help illustrate their behavior.

(6)

The F1 measure was adopted to evaluate the performance of the proposed operators. F1 was first proposed as a measure of effectiveness in TC by Lewis [14]. F1 is defined as:

IT h = (T h × Sim).

3

THE EXPERIMENTS

In order to investigate the proposed operators, the 20 newsgroup, Ohsumed, and Reuters-21578 datasets were used: • 20 Newsgroups Collection (20NG)1 is a collection of nearly 20,000 articles posted to the Usenet newsgroups [11]. Some of the newsgroups are much related together (e.g. rec.autos, rec.motorcycles), while others are highly unrelated (e.g. comp.graphics, talk.politics.mideast). The standard split "bydate" was used in which duplicates and headers were removed. This results in 18,941 documents, 60% of them are reserved for the training set, while the test set contains the remaining 40% [30]. • Ohsumed Dataset2 is a collection of 348,566 references gathered from 270 medical journal published from 1987 to 1991 [10]. In this work, the subset used by [12], [4] was followed. In this subset, only the first 20,000 documents with abstracts published in 1991 were considered which resulted in 23 categories. The first 10,000 had been used as training set and the rest as test set. • Reuters-21578 Dataset3 has been a standard benchmark in TC for the last 10 years [6]. It consists of over 20,000 news stories appeared in Reuters newswire in the period from 1987 [9]. In this experiment, we used the ModeApté split (Reuters(90)) that contains all categories with at least one positive training example and one positive test example which results in 90 categories [12]. Each of the used data-sets has its own unique characteristics. The documents in the 20NG, unlike those of Ohsumed and Reuters(90), are large and contain diverse vocabulary. This makes the 20NG a more challenging classification problem and is excepted to be highly computational. Another distinction is that the 20NG data-set has a nearly even distribution, i.e. each category has nearly the same number of documents. This is in contrast to Ohsumed and Reuters(90). While Ohsumed can be considered as a moderate diverse data-set, Reuters(90) is considered as a highly skewed one. This skew is evident when considering that the largest category in Reuters(90) "earn" contains about 2800 training examples whereas the 1 The

20 Newsgroups and the bydate split can be found at http://people.csail.mit.edu/people/jrennie/20Newsgroups. 2 Ohsumed

is available http://trec.nist.gov/data/t9_filtering/.

3 Reuters-21578

is available at http://www.daviddlewis.com/resources/testcollections/reuters21578/.

F1 (i) =

(2 × T Pi ) , 2 × T P i + F Pi + F Ni

(7)

where T Pi , F Pi and F Ni refer to the sets of true positives w.r.t. ci (documents correctly classified to belong to ci ), false positives w.r.t. ci (documents incorrectly classified to belong to ci ), and false negatives w.r.t. ci (documents incorrectly classified not to belong to ci ), respectively. To compute the global performance over all categories, Micro Averaging (microF1 ) and Macro averaging (macroF1 ) tests are used. In the microF1 test, all the binary decisions are collected in a joint pool and then F1 is computed. On the other hand, the macroF1 test is based on calculating F1 for individual categories. The measure is then averaged over all categories [27]. Generally, the macroF1 test equally weights all categories, and thus it is influenced by the performance of rare categories. On the other hand, microF1 test is affected by the performance of frequent categories [6].

4 Results Pairs of feature scoring methods selected were combined using the three operators; union (UN), union-cut (UC), and intersection (INT). In this study we applied classification using the SV M light [13]4 . The experimental results were evaluated on 20NG, Ohsumed and Reuters(90) data-sets. It might be worth noting that since the 20NG dataset is evenly distributed the microF1 and macroF1 have matching values. Therefore, only the results of microF1 are reported. To facilitate the assessment of UC, the similarity between each scoring method (M1 and M2) and the UC list was evaluated at each examined threshold (Th). On the other hand, with respect to the analysis of UN and INT, thresholds UTh and ITh were calculated. They are the equivalent thresholds due to using the UN and INT operators respectively. The performance of the UN and INT operators at a certain threshold (Th) were compared with that of M1 and M2 at UTh and ITh respectively. In the following we will focus on the evaluation of the combining methods for each data-set separately.

4.1

The 20NG data-set

The results showed that using both IG and MI feature selection methods outperform the DF and CC. The 4 SV

M light is publicly published at http://svmlight.joachims.org.

Threshold (%) Th UTh ITh

Table 1: Performance Evaluation on the 20NG M1-M2

0.5 1 1.5 2.5 5 7.5 10

0.8 1.4 2.1 3.4 6.4 9.5 12.6

0.2 0.6 0.9 1.6 3.6 5.5 7.4

49.3 55.6 57.7 65.4 71.9 73.9 74.1

0.5 1 1.5 2.5 5 7.5 10

0.8 1.6 2.3 3.7 7.0 10.3 13.6

0.2 0.5 0.7 1.3 3.0 4.7 6.4

36.8 45 46.7 53.3 59.2 62.8 64.4

0.5 1 1.5 2.5 5 7.5 10

0.8 1.5 2.2 3.5 6.6 9.7 12.7

0.2 0.5 0.8 1.5 3.4 5.3 7.3

45.6 51.2 54.7 61.3 67.9 70.2 72.6

0.5 1 1.5 2.5 5 7.5 10

0.7 1.5 2.2 3.5 6.7 9.8 12.7

0.3 0.5 0.8 1.5 3.3 5.2 7.3

51.1 53.5 54.9 59.3 65.8 69.3 72.8

0.5 1 1.5 2.5 5 7.5 10

0.7 1.3 2.0 3.2 6.1 8.8 11.5

0.3 0.7 1.0 1.8 3.9 6.2 8.5

60.5 66.1 69.5 73 78.9 82.1 84.9

0.5 1 1.5 2.5 5 7.5 10

0.6 1.2 1.8 2.9 5.8 8.7 11.3

0.4 0.8 1.2 2.1 4.2 6.3 8.7

81.2 83.7 81.2 83.8 83.6 84.1 87.3

Similarity (%) M1-UC M2-UC CC and DF 62.2 87.2 66.8 88.9 67.4 90.3 73.9 92.3 78.6 93.3 79.8 94.1 81.3 93.3 CC and IG 78.5 58.4 82.6 62.4 81.6 65.1 85.7 67.6 87.1 72.1 88.9 73.9 88.3 76.1 CC and MI 78.9 67.6 78.4 72.8 78.6 76 82.8 78.4 84.9 82.9 86.1 84.1 87 85.7 DF and IG 99.1 52 99.6 54 99.7 55.2 99.7 59.7 99.3 66.5 99.5 69.8 99.5 73.4 DF and MI 100 60.6 100 66.1 99.4 70 99.8 73.2 99.7 79.2 99.7 82.4 99.7 85.2 IG and MI 84.7 98.8 85.2 98.6 82.8 98.6 85 99.1 85.2 98.4 85.8 98.6 88 99.3

M1

M2

microF1 UN

UC

INT

0.653 0.709 0.737 0.757 0.774 0.781 0.785

0.671 0.717 0.737 0.752 0.775 0.783 0.786

0.691 0.732 0.750 0.766 0.777 0.785 0.790

0.644 0.700 0.717 0.748 0.771 0.781 0.785

0.613 0.689 0.711 0.743 0.766 0.780 0.781

0.653 0.709 0.737 0.757 0.774 0.781 0.785

0.725 0.755 0.770 0.786 0.791 0.796 0.798

0.737 0.758 0.774 0.782 0.788 0.792 0.794

0.682 0.722 0.746 0.765 0.777 0.783 0.789

0.611 0.687 0.725 0.760 0.776 0.788 0.790

0.653 0.709 0.737 0.757 0.774 0.781 0.785

0.719 0.745 0.761 0.775 0.785 0.790 0.794

0.726 0.754 0.771 0.774 0.786 0.788 0.794

0.679 0.717 0.743 0.758 0.777 0.781 0.787

0.622 0.693 0.719 0.753 0.772 0.785 0.786

0.671 0.717 0.737 0.752 0.775 0.783 0.786

0.725 0.755 0.770 0.786 0.791 0.796 0.798

0.728 0.747 0.765 0.779 0.787 0.790 0.792

0.672 0.718 0.737 0.753 0.775 0.782 0.787

0.649 0.719 0.737 0.756 0.778 0.785 0.791

0.671 0.717 0.737 0.752 0.775 0.783 0.786

0.719 0.745 0.761 0.775 0.785 0.790 0.794

0.720 0.744 0.758 0.772 0.784 0.787 0.791

0.670 0.717 0.735 0.753 0.775 0.782 0.786

0.659 0.718 0.737 0.755 0.775 0.785 0.789

0.725 0.755 0.770 0.786 0.791 0.796 0.798

0.719 0.745 0.761 0.775 0.785 0.790 0.794

0.730 0.754 0.771 0.785 0.790 0.795 0.796

0.719 0.744 0.762 0.774 0.785 0.790 0.795

0.707 0.742 0.762 0.776 0.787 0.791 0.797

performance diversity among these different selection methods reduced as the threshold used increased. However, it can be noted that neither CC nor DF achieved a performance that matches that of IG and MI even at the highest threshold. Table 1 illustrates the performance of the 20NG data-set using the different proposed operators. By analyzing these results we can make the following observations:

performance was better than that of CC; however it is not sufficient to outperform that of IG or MI. • DF with IG or MI: The similarity measure indicates that the UC list was nearly identical to that of the DF. Therefore, the performance of the UC list matched the DF. Combining this observation with the selection criteria of the UC list, which is based on the DF, indicates that the features selected by IG or MI were less frequent in the documents. This supports the assertion that the frequently occurring elements are not necessarily the best selection for text categorization.

4.1.1 UN Operator • CC with DF : Although the performance of the UN list was better than both methods at the same threshold. It was nearly the same as (or sometime slight better than) the best of them (namely DF) at the equivalent threshold (UTh). This shows that although CC and DF produced different features, they generalized to the same decision boundary using SVM. • CC with IG or MI: Comparing the microF1 of the UN list with that of individual methods at the same threshold showed a slight improvement notable at low thresholds. However, this improvement did not suppress the performance of IG or MI at the UTh despite exceeding CC. This indicates that adding features selected by CC to either IG or MI improved the classification marginally. However, the computational overhead needed to compose the combined feature set rendered it inefficient. • DF with IG or MI: The results showed that the microF1 of the UN list was less than that of IG or MI at the equivalent threshold. As a matter of fact, with the exception of the threshold 0.5 the results of IG or MI at the original threshold outperformed their aggregated lists using the UN operator. This means that adding features selected by using the DF added unreliable features to those selected and hence reduces the classification accuracy. • IG with MI: The performance of the UN list almost matched that of IG at the original threshold. This implies that adding features selected by MI to IG did not degrade the performance, yet it also produced no significant improvement.

4.1.2 Using UC Operator • CC with DF : The UC list selected features from both methods with some bias towards DF. However, the performance of the UC was less than both the DF and CC. This indicates that the criteria based on DF selects features that do not contribute to successful categorization. • CC with IG or MI: The UC operator in this case was biased towards features selected using CC. The

• IG with MI : The similarity between the UC and MI lists was apparent. Hence, the performance was almost identical to that of MI. This indicates that the features selected by IG had less DF than those which were generated by MI.

4.1.3 Using INT Operator • DF or CC with IG or MI: Comparing the performance of the INT list at the ITh showed better performance than either DF or CC. However, it was not successful in achieving performance levels that matched either the IG or MI. • (IG with MI) and (CC with DF) : The classification results of each of these pairs was relatively close. The performance of the INT operator almost matched that of the better selection method. However, the confidence in the INT list is considered to be higher since it was generated using two different methods.

4.2

The Ohsumed data-set

Similar to the results of the 20NG, there is a performance diversity among the used feature scoring methods. IG and MI outperformed DF and CC. It might be worth noting that CC outperformed DF in Ohsumed data-set in the microF1 level. This is on the contrary of 20NG where DF outperformed CC slightly. Applying the combining operators on this data-set led to the same conclusions in both microF1 and macroF1 . Table 2 illustrates the performance evaluation of the proposed operators on Ohsumed data-set. An analysis to that performance is provided in the following. 4.2.1 Using UN Operator • CC with DF: The UN list had nearly the same microF1 and macroF1 as the best of the two methods, namely CC, at UTh. • DF or CC with IG or MI: The UN list showed improved performance compared to the least performance lists, namely DF or CC, at UTh. However,

Table 2: Performance Evaluation on Ohsumed Data-set Threshold (%) Th UTh ITh

M1-M2

Similarity (%) M1-UC M2-UC

M1

0.5 1 1.5 2.5 5 7.5 10

0.8 1.6 2.3 3.6 6.8 9.7 12.5

0.2 0.4 0.7 1.4 3.2 5.3 7.5

36.8 43.4 49.5 57.2 64.6 71 74.7

49.5 54 57.5 65.7 74.2 77.5 80.1

89.1 90.9 92.3 91.6 91.8 93.5 94.6

0.380 0.447 0.490 0.542 0.584 0.601 0.604

0.5 1 1.5 2.5 5 7.5 10

0.8 1.6 2.4 3.9 7.5 10.8 14.1

0.2 0.4 0.6 1.1 2.5 4.2 5.9

34 42.9 43.2 44.6 50 55.6 58.9

72.4 80.1 78.3 81.8 85.6 87.9 90.5

64.3 64.8 65 62.9 64.7 67.8 68.5

0.380 0.447 0.490 0.542 0.584 0.601 0.604

0.5 1 1.5 2.5 5 7.5 10

0.8 1.5 2.3 3.7 7 10.1 12.9

0.2 0.5 0.7 1.3 3 4.9 7.1

36.8 45.4 47.8 53.4 59.6 65.1 71.2

68.6 76.1 75.7 79.8 83.5 86.5 88.9

68.6 69.3 72.1 73.5 76.5 78.7 82.2

0.380 0.447 0.490 0.542 0.584 0.601 0.604

0.5 1 1.5 2.5 5 7.5 10

0.8 1.5 2.3 3.8 7.2 10.5 13.8

0.2 0.5 0.7 1.2 2.8 4.5 6.2

41.6 48 47.2 48.4 55.2 59.4 61.7

96.9 99.5 98.7 97.8 98.6 98.5 99

45.9 49 48.5 50.6 56.6 61 62.7

0.385 0.455 0.485 0.527 0.573 0.597 0.601

0.5 1 1.5 2.5 5 7.5 10

0.8 1.5 2.2 3.5 6.5 9.5 12.3

0.2 0.5 0.8 1.5 3.5 5.5 7.7

47.5 53 55.5 61 69.8 73 77

96 99 99.3 98.6 99.3 98.6 99.2

51.5 54 56.2 62.4 70.5 74.4 77.8

0.385 0.455 0.485 0.527 0.573 0.597 0.601

0.5 1 1.5 2.5 5 7.5 10

0.5 1.1 1.7 2.8 5.8 8.9 11.6

0.5 0.9 1.3 2.2 4.2 6.1 8.4

90.8 89.8 86.7 88.2 83.7 81.6 83.6

91.8 91.3 88.3 88.2 85.1 85.9 84.4

99 98.5 98.3 100 98.6 99.1 99.2

0.486 0.532 0.567 0.605 0.629 0.635 0.636

microF1 M2 UN CC and DF 0.385 0.418 0.455 0.500 0.485 0.526 0.527 0.553 0.573 0.590 0.597 0.600 0.601 0.603 CC and IG 0.486 0.493 0.532 0.538 0.567 0.575 0.605 0.606 0.629 0.620 0.635 0.621 0.636 0.621 CC and MI 0.475 0.486 0.518 0.531 0.562 0.572 0.595 0.596 0.620 0.615 0.621 0.615 0.623 0.617 DF and IG 0.486 0.484 0.532 0.528 0.567 0.562 0.605 0.594 0.629 0.612 0.635 0.614 0.636 0.614 DF and MI 0.475 0.475 0.518 0.521 0.562 0.556 0.595 0.585 0.620 0.608 0.621 0.611 0.623 0.611 IG and MI 0.475 0.482 0.518 0.534 0.562 0.577 0.595 0.608 0.620 0.623 0.621 0.627 0.623 0.626

macroF1 UN UC

UC

INT

M1

M2

INT

0.378 0.423 0.473 0.527 0.569 0.590 0.596

0.344 0.406 0.448 0.504 0.570 0.595 0.604

0.273 0.362 0.395 0.444 0.499 0.517 0.521

0.272 0.359 0.392 0.440 0.490 0.514 0.518

0.310 0.406 0.429 0.461 0.508 0.515 0.519

0.232 0.315 0.368 0.422 0.480 0.501 0.510

0.225 0.316 0.355 0.414 0.486 0.514 0.519

0.436 0.491 0.534 0.567 0.602 0.608 0.609

0.358 0.438 0.481 0.517 0.591 0.609 0.618

0.273 0.362 0.395 0.444 0.499 0.517 0.521

0.405 0.457 0.508 0.546 0.561 0.566 0.562

0.412 0.460 0.509 0.535 0.546 0.541 0.538

0.300 0.381 0.429 0.466 0.514 0.521 0.521

0.258 0.356 0.399 0.432 0.513 0.532 0.542

0.436 0.489 0.531 0.562 0.601 0.607 0.607

0.353 0.441 0.477 0.526 0.587 0.603 0.609

0.273 0.362 0.395 0.444 0.499 0.517 0.521

0.397 0.446 0.498 0.528 0.548 0.548 0.545

0.406 0.453 0.499 0.522 0.540 0.536 0.535

0.301 0.381 0.419 0.460 0.512 0.520 0.521

0.255 0.358 0.392 0.434 0.508 0.525 0.526

0.396 0.456 0.490 0.534 0.577 0.598 0.602

0.393 0.459 0.481 0.529 0.587 0.615 0.621

0.272 0.359 0.392 0.440 0.490 0.514 0.518

0.405 0.457 0.508 0.546 0.561 0.566 0.562

0.395 0.446 0.494 0.523 0.536 0.533 0.532

0.280 0.360 0.396 0.443 0.494 0.512 0.519

0.279 0.370 0.391 0.446 0.511 0.540 0.547

0.384 0.456 0.486 0.529 0.576 0.598 0.602

0.392 0.457 0.480 0.531 0.584 0.610 0.614

0.272 0.359 0.392 0.440 0.490 0.514 0.518

0.397 0.446 0.498 0.528 0.548 0.548 0.545

0.386 0.442 0.483 0.510 0.530 0.530 0.529

0.274 0.359 0.393 0.442 0.493 0.513 0.519

0.278 0.368 0.392 0.450 0.507 0.534 0.532

0.478 0.518 0.557 0.592 0.619 0.621 0.622

0.480 0.518 0.550 0.592 0.626 0.628 0.632

0.405 0.457 0.508 0.546 0.561 0.566 0.562

0.397 0.446 0.498 0.528 0.548 0.548 0.545

0.402 0.457 0.518 0.543 0.551 0.552 0.546

0.399 0.440 0.492 0.528 0.549 0.548 0.542

0.401 0.444 0.488 0.532 0.558 0.558 0.559

there was a performance degradation when comparing the UN list with IG or MI lists at UTh. It is also noted that this degradation appeared at the same threshold value Th when combining DF with either IG and MI. This indicated that the features selected by either CC or DF do not contribute to performance enhancement. • IG with MI: Generally, the performance of UN list was approximately equivalent to IG at UTh.

4.2.2 Using UC Operator • CC with DF : The performance of the UC list was worse than both of the CC and DF lists. This is in spite of the high correlation between the UC list and the DF list (generally above 90%). This indicated that the high DF features produced by CC were not helpful in the classification process. • CC or DF with IG or MI: The performance of the UC operator was similar to the UN operator. While the UC list led to an enhancement comparing to the least performance method, it did not suppress either IG or MI. • IG with MI : The similarity between the UC list and the list produced by MI was high (above 98%). Therefore, the performance of the UC list was identical to that of MI which was outperformed by IG

4.2.3 Using INT Operator • CC with DF : The INT operator resulted in an improved in both microF1 and macroF1 at ITh. This improvement was attained since there was no diversity in the performance of CC and DF. • CC or DF with IG or MI: The performance of the INT list was similar to that of the UN and UC lists. Taking the intersection of two high diversity feature sets resulted in a list that suppressed the least performance list at ITh. However, the performance of the intersected list was worse than the best performance list. • IG with MI : The INT operator showed either an equivalent or slight improved performance due to the high similarity between IG, and MI

4.3

The Reuters(90) data-set

Table 3 illustrates the performance evaluation of the Reuters(90) split data-set. Unlike 20NG and Ohsumed data-sets, there is a limited diversity in the performance among different scoring methods in most cases. The only

exception was the performance of CC at low thresholds where the performance margin was notable in microF1 and macroF1 at thresholds below 1.5% and 5% respectively. The analysis of the performance of the combining operators yielded the following observations:

4.3.1 Using UN Operator • CC with DF: Since DF outperformed CC at most thresholds, the UN list had nearly the same microF1 as DF at UTh. On the other hand, despite the high superiority of DF over CC in macroF1 , the macroF1 of UN was less than that of DF at UTh due to the added bias towards frequent items introduced by CC. • CC with IG or MI: Unlike the DF, combining CC with either IG or MI showed a small improvement in both microF1 and macroF1 in most cases. This shows that this enhancement occurred in classifying both frequent and rare categories. It might be worth noted that using UN operator in combining CC with IG or MI provided a better microF1 performance than combining DF with either. This is in contrast with the degradation in the performance of the macroF1 . Combining this observation with the fact that DF outperformed CC in both microF1 and macroF1 indicates that adding the features selected by the later to either IG or MI improved the classification of frequent categories. • DF with IG or MI: Comparing the performance of the UN list with individual methods at UTh showed no improvement in microF1 and a slight enhancement in macroF1 . This indicates that using the UN operator enhanced a bias towards rare categories. • IG with MI: Since the similarity between the feature sets produced by both methods was high (Above 90%), the performance of UN list was nearly the same as the best of them at UTh.

4.3.2 Using UC Operator • CC with DF : The performance observed was similar to that of the UN operator on this feature selection pair. • CC or DF with IG or MI: The UC operator in this case achieved an enhancement in microF1 at nearly all thresholds. However, the enhancement was not matched in the macroF1 . As a matter of fact, the UC produced a degraded performance compared to the best selection method. This can be attributed to the bias of the UC to either the CC or DF, which favor frequent documents. • IG with MI : Despite the high similarity between the lists produced by these selection methods,

Table 3: Performance Evaluation on Reuters(90) Data-set Threshold (%) Th UTh ITh

M1-M2

Similarity (%) M1-UC M2-UC

M1

0.5 1 1.5 2.5 5 7.5 10

0.8 1.6 2.3 3.6 6.7 9.5 12.3

0.2 0.4 0.7 1.4 3.3 5.5 7.7

34 40.3 44.5 54.3 66.2 73.4 76.9

71 67.5 71.7 73.6 80.7 84.3 86.9

63 73.6 76.3 82 88.1 89.1 90

0.710 0.779 0.833 0.852 0.867 0.870 0.868

0.5 1 1.5 2.5 5 7.5 10

0.8 1.6 2.4 3.7 7.0 10.0 13.1

0.2 0.4 0.6 1.3 3.0 5.0 6.9

35 35.3 41.2 50.5 60.1 67.2 68.6

77 80.1 82.4 85.8 90 92.2 93.8

58 55.2 58.8 66.2 72.3 75 74.8

0.710 0.779 0.833 0.852 0.867 0.870 0.868

0.5 1 1.5 2.5 5 7.5 10

0.8 1.6 2.3 3.7 6.9 9.7 12.6

0.2 0.4 0.7 1.3 3.1 5.3 7.4

35 37.3 43.5 53.9 62.7 70.7 74

77 80.1 81.7 83 89.8 91.8 93.2

58 57.2 61.8 70.9 75.6 78.9 80.8

0.710 0.779 0.833 0.852 0.867 0.870 0.868

0.5 1 1.5 2.5 5 7.5 10

0.7 1.4 2.1 3.5 6.6 9.7 12.7

0.3 0.6 0.9 1.5 3.4 5.3 7.3

64 56.9 59.1 60.3 67.9 71.3 73

88 93.4 93.9 94.2 96.3 96.8 97.7

76 63.5 65.2 66.4 71.6 74.6 76.1

0.790 0.824 0.843 0.854 0.868 0.870 0.871

0.5 1 1.5 2.5 5 7.5 10

0.7 1.4 2.0 3.4 6.4 9.3 12.0

0.3 0.6 1.0 1.6 3.6 5.7 8.0

65 58.9 63.4 65.9 71.9 76.4 80.3

88 93.9 94.6 94 96.7 97.2 97.7

77 65 68.8 71.9 75.5 79.5 82.6

0.790 0.824 0.843 0.854 0.868 0.870 0.871

0.5 1 1.5 2.5 5 7.5 10

0.5 1.1 1.6 2.6 5.3 8.2 10.8

0.5 0.9 1.4 2.4 4.7 6.8 9.2

96 91 93.4 95 93.4 90.5 92.3

98 94.5 95 95.5 95.2 93.6 92.9

98 96.5 98.3 99.6 98.6 98.2 99.4

0.789 0.822 0.835 0.851 0.866 0.867 0.868

microF1 M2 UN CC and DF 0.790 0.809 0.824 0.840 0.843 0.857 0.854 0.859 0.868 0.867 0.870 0.871 0.871 0.868 CC and IG 0.789 0.818 0.822 0.844 0.835 0.859 0.851 0.866 0.866 0.869 0.867 0.871 0.868 0.868 CC and MI 0.798 0.814 0.827 0.846 0.835 0.859 0.855 0.865 0.866 0.868 0.873 0.870 0.871 0.868 DF and IG 0.789 0.813 0.822 0.842 0.835 0.850 0.851 0.857 0.866 0.868 0.867 0.870 0.868 0.868 DF and MI 0.798 0.810 0.827 0.836 0.835 0.851 0.855 0.858 0.866 0.867 0.873 0.868 0.871 0.870 IG and MI 0.798 0.803 0.827 0.827 0.835 0.838 0.855 0.856 0.866 0.866 0.873 0.871 0.871 0.870

macroF1 UN UC

UC

INT

M1

M2

INT

0.783 0.829 0.849 0.858 0.869 0.872 0.868

0.662 0.745 0.811 0.845 0.869 0.869 0.871

0.227 0.295 0.385 0.418 0.439 0.434 0.428

0.369 0.407 0.435 0.447 0.443 0.434 0.428

0.388 0.427 0.440 0.445 0.436 0.431 0.423

0.258 0.368 0.403 0.433 0.440 0.443 0.428

0.198 0.251 0.348 0.408 0.442 0.438 0.435

0.798 0.829 0.853 0.863 0.869 0.872 0.870

0.550 0.701 0.799 0.833 0.864 0.868 0.872

0.227 0.295 0.385 0.418 0.439 0.434 0.428

0.394 0.442 0.442 0.450 0.446 0.438 0.435

0.418 0.448 0.458 0.455 0.442 0.430 0.423

0.286 0.373 0.424 0.432 0.442 0.434 0.428

0.206 0.263 0.349 0.401 0.445 0.440 0.442

0.795 0.829 0.854 0.864 0.869 0.871 0.869

0.612 0.714 0.798 0.836 0.865 0.872 0.874

0.227 0.295 0.385 0.418 0.439 0.434 0.428

0.393 0.441 0.443 0.451 0.449 0.439 0.433

0.417 0.446 0.456 0.452 0.444 0.430 0.424

0.295 0.372 0.423 0.443 0.441 0.434 0.429

0.197 0.260 0.350 0.402 0.453 0.443 0.439

0.808 0.841 0.851 0.859 0.870 0.871 0.869

0.742 0.793 0.826 0.846 0.863 0.867 0.869

0.369 0.407 0.435 0.447 0.443 0.434 0.428

0.394 0.442 0.442 0.450 0.446 0.438 0.435

0.420 0.446 0.453 0.445 0.437 0.432 0.424

0.400 0.421 0.448 0.448 0.445 0.441 0.423

0.339 0.375 0.430 0.452 0.449 0.443 0.434

0.805 0.836 0.850 0.860 0.870 0.871 0.870

0.772 0.811 0.828 0.848 0.865 0.873 0.872

0.369 0.407 0.435 0.447 0.443 0.434 0.428

0.393 0.441 0.443 0.451 0.449 0.439 0.433

0.417 0.443 0.451 0.443 0.439 0.430 0.426

0.402 0.422 0.448 0.447 0.444 0.433 0.424

0.343 0.394 0.438 0.446 0.448 0.444 0.434

0.803 0.827 0.836 0.857 0.866 0.871 0.871

0.767 0.811 0.831 0.847 0.866 0.867 0.868

0.394 0.442 0.442 0.450 0.446 0.438 0.435

0.393 0.441 0.443 0.451 0.449 0.439 0.433

0.410 0.441 0.448 0.450 0.445 0.438 0.430

0.405 0.445 0.447 0.456 0.448 0.438 0.431

0.384 0.443 0.441 0.449 0.450 0.438 0.435

combining them using the UC operator led to a slight improvement in macroF1 at low thresholds. Since the similarity measure indicates that the UC list had some bias towards MI, this improvement indicates that the high DF features selected by MI were helpful in distinguishing rare categories.

4.3.3 Using INT Operator • CC with all other methods : The INT operator resulted in an improved microF1 at ITh in threshold values greater than 1.5%. On the other hand, this improvement was not matched w.r.t macroF1 due to the large performance gap between CC and all other methods. • DF with IG or MI: The INT operator led to an improvement performance in microF1 at the corresponding threshold, ITh. This improvement was consistent at most thresholds. However, this was not matched with an improvement in macroF1 . This can be attributed to the poor performance of DF in the macroF1 compared with IG or MI. • IG with MI : There was no significant reduction in the number of selected features due to the high similarity between the IG and MI lists. Hence, there was no significant difference in the performance of this operator compared to the IG or MI selection methods. The robustness of the list was enhanced due to being confirmed by two different selection methods.

5

Conclusion and Discussion

In this paper we proposed three operators for combining pairs of local feature scoring methods. These operators are (a) Union, which takes the union of the two lists (UN operator), (b) Union-cut, which limits the size of the union by eliminating the low DF features that are selected (UC operator), and (c) Intersection, that selects the common features between the two lists (INT operator). Experimental comparisons were conducted on three distinct benchmark datasets; 20NG, Ohsumed, and Reuters(90) data-sets. Table 4 shows a summary of the results obtained when applying the combining operators on these data-sets. The suggested operators showed some potential in enhancing the performance of the Reuters(90). Combining DF with either MI or IG using the UC and the INT operators yielded to an improvement in microF1 . However, the UN had the best improvement of macroF1 as the UC and INT enhanced the macroF1 in only a limited number of cases. On the other hand, combining CC with either 5+

refers to performance improvement, - refers to performance degradation, and ' refers to approximate performance

UN CC-DF CC-IG,CC-MI DF-IG,DF-MI IG-MI CC-DF CC-IG,CC-MI DF-IG,DF-MI IG-MI CC-DF CC-IG,CC-MI DF-IG,DF-MI IG-MI CC-DF CC-IG,CC-MI DF-IG,DF-MI IG-MI

UC INT microF1 of 20NG ' + ' ' + microF1 and macroF1 Ohsumed ' + ' ' microF1 of Reuters(90) ' ' +>1.5% + + +>1.5% ' + + ' ' ' macroF1 of Reuters(90) + + ' ' '

Table 4: Summary5of Results

MI or IG caused an improvement in the microF1 using all the suggested operators. Similarly to the combining of DF with either IG or MI, the UN operator had the best enhanced macroF1 . The UC operator resulted in the best improved microF1 when combining CC with DF while none of the operators enhanced the macroF1 . Since the correlation between MI and IG was high, none of the applied operators improved the performance with the exception of a slight improvement in macroF1 using the UC operator in some cases. Comparing the performance of UN, UC, and INT operators at thresholds UTh, Th, and ITh respectively, showed that the UC operator provided the best performance in microF1 . On the other hand, the UN operator had the best macroF1 among other operators. Considering the performance of all individual and combined methods at each threshold, showed that the best microF1 was achieved by using the UC operator in combining DF with IG. However, the best macroF1 was attained by applying the UN operator on CC and IG with the exception of some few thresholds where the UN list of DF and IG was better. On the other hand, applying the combining operators on the 20NG and Ohsumed data-sets showed no significant improvement with the exception of the INT operator. This operator showed a potential in enhancing the performance when it is used in combining DF with CC. However, none of the other operators led to an improvement when they were applied on other scoring methods. This is due to the large diversity in the performance among the scoring methods. The results suggest that applying the combining operators on data-sets of this nature leads to no improve-

ment in performance. As a general conclusion, the proposed combining operators were suitable when pairs of local feature scoring methods had comparable performance levels. The UC and UN operators had provided improved performance on the microF1 and macroF1 respectively. The INT operator in these cases provided more confidence in the quality of the features selected while potentially reducing the number of features. This led to an improved performance with respect to computational and storage requirements.

References [1] Chidanand Apté, Fred Damerau, and Sholom Weiss. Automated learning of decision rules for text categorization. ACM Transactions on Information Systems (TOIS), 12(3):233–251, July 1994. [2] Ron Bekkerman, Ran El-Yaniv, Naftali Tishby, and Yoad Winter. Distributional word clusters vs. words for text categorization. Journal of Machine Learning Research (JMLR), 3:1183–1208, March 2003. [3] Y.-C Chiang, J.-H. Chen. An intelligent news recommender agent for filtering and categorizing large volumes of text corpus. International journal of Intelligent Systems, 19(3):201–216, 2003. [4] Elias Combarro, Elena Montanes, Irene Diaz, Jose Ranilla, and Ricardo Mones. Introducing a family of linear measures for feature selection in text categorization. IEEE Transaction on Knowledge and Date Engineering, 17(9):1223–1232, September 2005. [5] Rozita Dara and Mohamed Kamel. Sharing training patterns among multiple classifiers. In Fabio Roli, Josef Kittler, and Terry Windeatt, editors, Multiple Classifier Systems, Fifth International Workshop, MCS2004, Cagliari, Italy June 2004, Proceedings, volume 3077 of Lecture Notes in Computer Science, pages 243–252. SpringerVerlag New York, Inc., Berlin, Germany, 2004. [6] Franca Debole and Fabrizio Sebastiani. An analysis of the relative hardness of reuters-21578 subsets. Journal of the American Society for Information Science and Technology (JASIST), 56(6):584–596, April 2005. [7] Irene Díaz, José Ranilla, Elena Montañes, Javier Fernández, and Elías Combarro. Improving performance of text categorization by combining filtering and support vector machines. Journal of the American Society for Information Science and Technology (JASIST), 55(7):579âA˘ S–592, ¸ May 2004. [8] Susan Dumais, John Platt, David Heckerman, and Mehran Sahami. Inductive learning algorithms and representations for text categorization. In Proc. of the 7th ACM International Conference on Information and Knowledge Management (CIKM’98), pages 148–155, Bethesda, United States, November 02–07, 1998. ACM Press, New York, United States. [9] Philip J. Hayes and Steven P. Weinstein. C ONSTRUE /T IS: a system for content-based indexing of a database of news stories. In Proc. of the IAAI-90, 2nd Conference on Innovative Applications of Artificial Intelligence, pages 49– 66, Boston, United States, 1990. AAAI Press, Menlo Park, United States.

[10] William Hersh, Chris Buckley, T. J. Leone, and David Hickam. Ohsumed: an interactive retrieval evaluation and new large test collection for research. In Proc. of the 17th ACM International Conference on Research and Development in Information Retrieval (SIGIR’94), pages 192–201, Dublin, Ireland, July 03–06, 1994. Springer-Verlag New York, Inc. [11] Thorsten Joachims. A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization. In Proc. of the 14th International Conference on Machine Learning (ICML’97), pages 143–151, Nashville, United States, July 08–12, 1997. Morgan Kaufmann Publishers, San Francisco, United States. [12] Thorsten Joachims. Text categorization with support vector machines: learning with many relevant features. In Proc. of the 10th European Conference on Machine Learning (ECML’98), volume 1398 of Lecture Notes in Computer Science, pages 137–142, Chemnitz, Germany, April 21–24, 1998. Springer-Verlag New York, Inc. [13] Thorsten Joachims. Making large-scale support vector machine learning practical. In Advances in Kernel Methods – Support Vector Learning, pages 169–184. MIT Press, Cambridge, MA, United States, 1999. [14] David Lewis. Evaluating and optimizing autonomous text classification systems. In Proc. of the18th ACM International Conference on Research and Development in Information Retrieval (SIGIR’95), pages 246–254, Seattle, Washington, United States, July 9–13, 1995. ACM Press, New York, United States. [15] David Lewis, Robert Schapire, James Callan, and Ron Papka. Training algorithms for linear text classifiers. In Proc. of the 19th ACM International Conference on Research and Development in Information Retrieval (SIGIR’96), pages 298–306, Zürich, Switzerland, August 18– 22, 1996. ACM Press, New York, United States. [16] Hwee Ng, Wei Goh, and Kok Low. Feature selection, perceptron learning, and a usability case study for text categorization. In Proc. of the 20th ACM International Conference on Research and Development in Information Retrieval (SIGIR’97), pages 67–73, Philadelphia, United States, July 27–31, 1997. ACM Press, New York, United States. [17] Kamal Nigam, John Laferty, and Andrew McCallum. Using maximum entropy for text classification. In Proc. of the IJCAI-99, Workshop on Machine Learning for Information Filtering, pages 61–67, Stockholm, Sweden, August 1st, 1999. Morgan Kaufmann Publishers, San Francisco, United States. [18] Giorgio Nunzio. A bidimensional view of documents for text categorisation. In Proc. of the 26th European Conference on IR Research (ECIR’04), volume 2997 of Lecture Notes in Computer Science, pages 112–126, Sunderland, United Kingdom, April 5–7, 2004. Springer-Verlag New York, Inc. [19] Georgios Paliouras, Vangelis Karkaletsis, Ion Androutsopoulos, and Constantine D. Spyropoulos. Learning rules for large-vocabulary word sense disambiguation: A comparison of various classifiers. In Proc. of the 2nd International Conference on Natural Language Processing, volume 1835 of Lecture Notes in Computer Science, pages 383–394, Patra, Greece, 2000. Springer-Verlag New York, Inc.

[20] Gautam Pant and Padmini Srinivasan. Learning to crawl: Comparing classification schemes. ACM Transactions on Information Systems (TOIS), 23(4):430–462, 2005. [21] Monica Rogati and Yiming Yang. High-performing feature selection for text classification. In Proc. of the 11th ACM International Conference on Information and Knowledge Management (CIKM’02), pages 659 – 661, McLean, Virginia, United States, November 04–09, 2002. ACM Press, New York, United States. [22] Gerard Salton and Christopher Buckley. Term-weighting approaches in automatic text retrieval. Information Processing and Management, 24(5):513–523, 1988. [23] Fabrizio Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys (CSUR), 34(1):1– 47, March 2002. [24] Yunqing Xia, Angelo Dalli, Yorick Wilks, and Louise Guthrie. FASiL adaptive email categorization system. In Proc. of the 6th International Conference of Computational Linguistics and Intelligent Text Processing (CICLing 2005), volume 3406 of Lecture Notes in Computer Science, pages 723–734, Mexico City, Mexico, February 13–19, 2005. Springer-Verlag New York, Inc. [25] Yiming Yang. Expert network: effective and efficient learning from human decisions in text categorisation and retrieval. In Proc. of the 17th ACM International Conference on Research and Development in Information Retrieval (SIGIR’94), pages 13–22, Dublin, Ireland, July 03–06, 1994. Springer-Verlag New York, Inc. [26] Yiming Yang, Tom Ault, Thomas Pierce, and Charles W. Lattimer. Improving text categorization methods for event

tracking. In Proc. of the 23rd ACM International Conference on Research and Development in Information Retrieval (SIGIR’00), pages 65–72, Athens, Greece, July 14– 18, 2000. ACM Press, New York, United States. [27] Yiming Yang and Xin Liu. A re-examination of text categorization methods. In Proc. of the 22th ACM International Conference on Research and Development in Information Retrieval (SIGIR’99), pages 42–49, Berkeley, California, United States, August 15–19, 1999. ACM Press, New York, United States. [28] Yiming Yang and Jan Pedersen. A comparative study on feature selection in text categorization. In Proc. of the 14th International Conference on Machine Learning (ICML’97), pages 412–420, Nashville, Tennessee, United States, July 08–12, 1997. Morgan Kaufmann Publishers, San Francisco, United States. [29] Yiming Yang, Seán Slattery, and Rayid Ghani. A study of approaches to hypertext categorization. Journal of Intelligent Information Systems, 18(2/3):219–241, 2002. Special Issue on Automated Text Categorization. [30] Dell Zhang, Xi Chen, and Wee Lee. Text classification with kernels on the multinomial manifold. In Proc. of the 28th ACM International Conference on Research and Development in Information Retrieval (SIGIR’05), pages 266–273, Salvador, Brazil, August 15–19, 2005. ACM Press, New York, United States. [31] Le Zhang, Jingbo Zhu, and Tianshun Yao. An evaluation of statistical spam filtering techniques. ACM Transactions on Asian Language Information Processing (TALIP), 3(4):243–269, 2004.