A New Feature Selection Score for Multinomial Naive Bayes Text Classification Based on KL-Divergence K ARL -M ICHAEL S CHNEIDER (University of Passau)

[email protected]

42nd Annual Meeting of the Association for Computational Linguistics (ACL 2004), July 21–26, 2004, Barcelona, Spain

Naive Bayes: Information Theoretic Framework

Text Classification • Assignment of text documents to predefined classes

KL(p, q) =

ing, document storage, etc.

20 Newsgroups 0.8

p(x)

∑ p(x) log q(x) x

• Measures the divergence of one probability distribution from another

• Popular machine learning technique

1

• Kullback-Leibler Divergence [4, 1]:

• Applications in news categorization, E-mail filtering, user model-

Naive Bayes

Results

• Simple probabilistic model of text generation

• Distribution of words in a document: p(wt |d) = n(wt , d)/|d |

• Performs well despite unrealistic independence assumptions [3]

• Classification [2]:

Classification Accuracy

Introduction

0.6

0.4

|V |

c *(d) = argmax log p(cj ) + j

• Problem with text: high dimensionality (20, 000 ∼ 100, 000 words) • Solution: use only a subset of the vocabulary

j

= argmax j

• Filtering approach

1

|d | 1

|d |

log p(cj ) +

∑ p(wt |d) log p(wt |cj ) t=1 |V |

log p(cj ) −

0 10

∑ p(wt |d) log p(wt |cj ) 1

|d |

j

log p(cj )

Feature Selection based on KL-Divergence Motivation tween d and c *

• Mixture model of document generation: |C |

p(d) =

∑ p(cj )p(d |cj ) j=1

0.9

0.85

0.8

0.75

• Goal of KL-divergence feature selection: – Maximize similarity of documents within training classes – Minimize similarity between different training classes

0.7 10

document from its class:

– Vocabulary modeled by a single multinomial random variable W – Document representation: word count vector d = hxi1 . . . xi |V |i – Distribution of documents:

KL(S) =

|V |

1

100

1000 Vocabulary Size

10000

p(wt |di )

p(wt |di ) log ∑ ∑ |S | p(wt |c(di ))

1

di ∈S t=1

MI KL dKL

Reuters 21578 macroaveraged recall

p(d |cj ) = p(|d |)|d |!

p(wt |cj )n(wt ,d)



n(wt , d)!

t=1

• Bayes’ rule:

p(cj |d) =

p(cj )p(d |cj ) p(d)

• Approximation of KL(S) based on two assumptions: – equal number of occurrences of wt in documents containing wt – equal document length • Average probability of wt in documents that contain wt :

c * = argmaxj p(cj |d)

• Classification:

• Computation of KL(S) has time complexity O (|S |)

˜d (wt |cj ) = p(wt |cj ) p

Parameter Estimation • Maximum likelihood estimates with Laplacean priors: |cj | p(cj ) = ∑j 0 |cj 0 |

p(wt |cj ) =

|cj | Njt

• Approximate KL-divergence of a training document from its class:

1 + ∑di ∈cj n(wt , di )

f KL(S) =

|V |

|V | + ∑s=1 ∑di ∈cj n(ws , di )

1

|V | |C |

|S | ∑ ∑

˜d (wt |cj ) log Njt p

˜d (wt |cj ) p

|V | |C |

=−

Njt

∑ ∑ p(cj )p(wt |cj ) log |cj | t=1 j=1

Feature Selection with Mutual Information • Approximate KL-divergence of a training document from the train• Mutual Information between two random variables [1]:

0.9

0.85

0.8

0.75

0.7 10

p(wt |cj )

t=1 j=1

Precision/Recall Breakeven Point

0.95

|V |

100000

Micro Recall 100 5,000 20,000 MI 86.5 88.0 89.3 KL 87.2 (+0.8%) 89.2 (+1.5%) 90.1 (+0.9%) dKL 88.0 (+1.8%) 89.0 (+1.2%) 89.3 (+0.0%)

KL-Divergence Score • Average KL-divergence of the distribution of words in a training

• Probabilistic model of text generation [5]:

100000

MI KL dKL

0.95

• Naive Bayes selects the class c * with minimal KL-divergence be-

Naive Bayes: Probabilistic Framework

10000

Reuters 21578 microaveraged recall

• Performs well in text classification [7] sumptions

1000 Vocabulary Size

1

t=1

• Common feature selection score

• But requires artificial definition of random variables with wrong as-

100

p(wt |d)

Mutual Information Score • Mutual Information well known in Information Theory [1]

MI KL dKL

|V |

= argmin KL(p(W |d), p(W |cj )) −

– Score words independently – Greedy selection of highest scored words

0.2

t=1

= argmax

– Reduce computational overhead – Avoid overfitting to training data

∑ n(wt , d) log p(wt |cj )

Precision/Recall Breakeven Point

Feature Selection

100

1000 Vocabulary Size

10000

100000

Macro Recall 100 5,000 20,000 MI 76.1 75.1 78.3 KL 80.1 (+5.3%) 80.7 (+7.4%) 82.2 (+5.0%) dKL 78.8 (+3.7%) 78.0 (+3.9%) 78.5 (+0.2%)

ing corpus: |V |

MI(X ; Y ) =

Nj e K (S) = − ∑ p(wt ) log |S | t=1

p(x, y )

∑ ∑ p(x, y ) log p(x)p(y ) x

y

• Application to feature selection: measure the mutual information between a word and the class variable

• Requires individual random variables for all words [5, 6]: p(Wt = 1) = p(W = wt )

Conclusions

• KL-divergence score for wt : |C |



Njt Nj f e KL(wt ) = Kt (S) − KLt (S) = ∑ p(cj ) p(wt |cj ) log − p(wt ) log |cj | |S | j=1

• Wt models the occurrence of wt at any given position • Mutual information feature score: MI(wt ) = MI(Wt ; C) • Random variables Wt are not independent



• Feature score based on approximation of KL-divergence: – comparable to or slightly better than mutual information – better performance on smaller categories of Reuters 21578 • Feature score based on true KL-divergence (future work): – considerably higher performance than mutual information on various datasets

– automatic feature subset selection for maximum performance

References [1] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. John Wiley, New York, 1991. [2] Inderjit S. Dhillon, Subramanyam Mallela, and Rahul Kumar. Enhanced word clustering for hierarchical text classification. In Proc. 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 191–200, 2002. [3] Pedro Domingos and Michael Pazzani. On the optimality of the simple bayesian classifier under zero-one loss. Machine Learning, 29:103–130, 1997. [4] S. Kullback and R. A. Leibler. On information and sufficiency. Annals of Mathematical Statistics, 22:79–86, 1951. [5] Andrew McCallum and Kamal Nigam. A comparison of event models for Naive Bayes text classification. In Learning for Text Categorization: Papers from the AAAI Workshop, pages 41–48. AAAI Press, 1998. Technical Report WS-98-05. [6] Jason D. M. Rennie. Improving multi-class text classification with Naive Bayes. Master’s thesis, Massachusetts Institute of Technology, 2001. [7] Yiming Yang and Jan O. Pedersen. A comparative study on feature selection in text categorization. In Proc. 14th International Conference on Machine Learning (ICML-97), pages 412–420, 1997.

A New Feature Selection Score for Multinomial Naive Bayes Text ...

Bayes Text Classification Based on KL-Divergence .... 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 191–200, ...

57KB Sizes 1 Downloads 317 Views

Recommend Documents

A New Feature Selection Score for Multinomial Naive ...
assumptions: (i) the number of occurrences of wt is the same in all documents that contain wt, (ii) all documents in the same class cj have the same length. Let Njt be the number of documents in cj that contain wt, and let. ˜pd(wt|cj) = p(wt|cj). |c

Using naive Bayes method for learning object ...
Using naive Bayes method for learning object identification. Giedrius BALBIERIS. Computer Networking Department, Kaunas University of Technology. Student.

Techniques for Improving the Performance of Naive Bayes ... - CiteSeerX
and student of the WebKB corpus and remove all HTML markup. All non-alphanumeric ..... C.M., Frey, B.J., eds.: AI & Statistics 2003: Proceedings of the Ninth.

A Global-Model Naive Bayes Approach to the ...
i=1(Ai=Vij |Class)×P(Class) [11]. However this needs to be adapted to hierarchical classification, where classes at different levels have different trade-offs of ...

Feature Selection for SVMs
в AT&T Research Laboratories, Red Bank, USA. ttt. Royal Holloway .... results have been limited to linear kernels [3, 7] or linear probabilistic models [8]. Our.

Unsupervised Feature Selection for Biomarker ... - Semantic Scholar
Feature selection and weighting do both refer to the process of characterizing the relevance of components in fixed-dimensional ..... not assigned.no ontology.

Reconsidering Mutual Information Based Feature Selection: A ...
Abstract. Mutual information (MI) based approaches are a popu- lar feature selection paradigm. Although the stated goal of MI-based feature selection is to identify a subset of features that share the highest mutual information with the class variabl

Unsupervised Feature Selection for Biomarker ...
factor analysis of unlabeled data, has got different limitations: the analytic focus is shifted away from the ..... for predicting high and low fat content, are smoothly shaped, as shown for 10 ..... Machine Learning Research, 5:845–889, 2004. 2.

Unsupervised Feature Selection for Biomarker ...
The proposed framework allows to apply custom data simi- ... Recently developed metabolomic and genomic measuring technologies share the .... iteration number k; by definition S(0) := {}, and by construction |S(k)| = k. D .... 3 Applications.

Feature Selection for Ranking
uses evaluation measures or loss functions [4][10] in ranking to measure the importance of ..... meaningful to work out an efficient algorithm that solves the.

Variational bayes for modeling score distributions
Dec 3, 2010 - outperforms the dominant exponential-Gaussian model. Keywords Score distributions Б Gaussian mixtures Б Variational inference Б.

a feature selection approach for automatic music genre ...
format [14]. The ID3 tags are a section of the compressed MP3 audio file that con- ..... 30-second long, which is equivalent to 1,153 frames in the MP3 file format. We argue that ...... in machine learning, Artificial Intelligence 97 (1997) 245–271

A Comparison of Event Models for Naive Bayes Anti ...
Compare two event models for a Naive Bayes classifier in the context of .... 0.058950 linguistics. 0.049901 remove. 0.113806 today. 0.054852 internet ... 0.090284 market. 0.052483 advertise. 0.046852 money. 0.075317 just. 0.052135 day.

A Category-Dependent Feature Selection Method for ...
a significant increase in recognition rate for low signal-to-noise ratios compared with ... As a method of studying how the A1 response maps audio signals to the ...

Text Feature project.pdf
Page 1 of 2. Group Members: Non– Fiction Text Features. helps us understand the text. Heading: Label: Table of Contents: Caption: Key Word: Page 1 of 2 ...

Implementation of genetic algorithms to feature selection for the use ...
Implementation of genetic algorithms to feature selection for the use of brain-computer interface.pdf. Implementation of genetic algorithms to feature selection for ...

Feature Selection for Density Level-Sets
approach generalizes one-class support vector machines and can be equiv- ... of the new method on network intrusion detection and object recognition ... We translate the multiple kernel learning framework to density level-set esti- mation to find ...

Markov Blanket Feature Selection for Support Vector ...
ing Bayesian networks from high-dimensional data sets are the large search ...... Bayesian network structure from massive datasets: The “sparse candidate” ...

Unsupervised Feature Selection for Outlier Detection by ...
v of feature f are represented by a three-dimensional tuple. VC = (f,δ(·),η(·, ·)) , ..... DSFS 2, ENFW, FPOF and MarP are implemented in JAVA in WEKA [29].

Application to feature selection
[24] M. Abramowitz and I. A. Stegun, Handbook of Mathematical Functions. N.Y.: Dover, 1972. [25] T. Anderson, An Introduction to Multivariate Statistics. N.Y.: Wiley,. 1984. [26] A. Papoulis and S. U. Pillai, Probability, Random Variables, and. Stoch