A New Feature Selection Score for Multinomial Naive Bayes Text ...

Viewer
Transcript

A New Feature Selection Score for Multinomial Naive Bayes Text Classification Based on KL-Divergence K ARL -M ICHAEL S CHNEIDER (University of Passau)

[email protected]

42nd Annual Meeting of the Association for Computational Linguistics (ACL 2004), July 21–26, 2004, Barcelona, Spain

Naive Bayes: Information Theoretic Framework

Text Classification • Assignment of text documents to predefined classes

KL(p, q) =

ing, document storage, etc.

20 Newsgroups 0.8

p(x)

∑ p(x) log q(x) x

• Measures the divergence of one probability distribution from another

• Popular machine learning technique

1

• Kullback-Leibler Divergence [4, 1]:

• Applications in news categorization, E-mail filtering, user model-

Naive Bayes

Results

• Simple probabilistic model of text generation

• Distribution of words in a document: p(wt |d) = n(wt , d)/|d |

• Performs well despite unrealistic independence assumptions [3]

• Classification [2]:

Classification Accuracy

Introduction

0.6

0.4

|V |

c *(d) = argmax log p(cj ) + j

• Problem with text: high dimensionality (20, 000 ∼ 100, 000 words) • Solution: use only a subset of the vocabulary

j

= argmax j

• Filtering approach

1

|d | 1

|d |

log p(cj ) +

∑ p(wt |d) log p(wt |cj ) t=1 |V |

log p(cj ) −

0 10

∑ p(wt |d) log p(wt |cj ) 1

|d |

j

log p(cj )

Feature Selection based on KL-Divergence Motivation tween d and c *

• Mixture model of document generation: |C |

p(d) =

∑ p(cj )p(d |cj ) j=1

0.9

0.85

0.8

0.75

• Goal of KL-divergence feature selection: – Maximize similarity of documents within training classes – Minimize similarity between different training classes

0.7 10

document from its class:

– Vocabulary modeled by a single multinomial random variable W – Document representation: word count vector d = hxi1 . . . xi |V |i – Distribution of documents:

KL(S) =

|V |

1

100

1000 Vocabulary Size

10000

p(wt |di )

p(wt |di ) log ∑ ∑ |S | p(wt |c(di ))

1

di ∈S t=1

MI KL dKL

Reuters 21578 macroaveraged recall

p(d |cj ) = p(|d |)|d |!

p(wt |cj )n(wt ,d)

∏

n(wt , d)!

t=1

• Bayes’ rule:

p(cj |d) =

p(cj )p(d |cj ) p(d)

• Approximation of KL(S) based on two assumptions: – equal number of occurrences of wt in documents containing wt – equal document length • Average probability of wt in documents that contain wt :

c * = argmaxj p(cj |d)

• Classification:

• Computation of KL(S) has time complexity O (|S |)

˜d (wt |cj ) = p(wt |cj ) p

Parameter Estimation • Maximum likelihood estimates with Laplacean priors: |cj | p(cj ) = ∑j 0 |cj 0 |

p(wt |cj ) =

|cj | Njt

• Approximate KL-divergence of a training document from its class:

1 + ∑di ∈cj n(wt , di )

f KL(S) =

|V |

|V | + ∑s=1 ∑di ∈cj n(ws , di )

1

|V | |C |

|S | ∑ ∑

˜d (wt |cj ) log Njt p

˜d (wt |cj ) p

|V | |C |

=−

Njt

∑ ∑ p(cj )p(wt |cj ) log |cj | t=1 j=1

Feature Selection with Mutual Information • Approximate KL-divergence of a training document from the train• Mutual Information between two random variables [1]:

0.9

0.85

0.8

0.75

0.7 10

p(wt |cj )

t=1 j=1

Precision/Recall Breakeven Point

0.95

|V |

100000

Micro Recall 100 5,000 20,000 MI 86.5 88.0 89.3 KL 87.2 (+0.8%) 89.2 (+1.5%) 90.1 (+0.9%) dKL 88.0 (+1.8%) 89.0 (+1.2%) 89.3 (+0.0%)

KL-Divergence Score • Average KL-divergence of the distribution of words in a training

• Probabilistic model of text generation [5]:

100000

MI KL dKL

0.95

• Naive Bayes selects the class c * with minimal KL-divergence be-

Naive Bayes: Probabilistic Framework

10000

Reuters 21578 microaveraged recall

• Performs well in text classification [7] sumptions

1000 Vocabulary Size

1

t=1

• Common feature selection score

• But requires artificial definition of random variables with wrong as-

100

p(wt |d)

Mutual Information Score • Mutual Information well known in Information Theory [1]

MI KL dKL

|V |

= argmin KL(p(W |d), p(W |cj )) −

– Score words independently – Greedy selection of highest scored words

0.2

t=1

= argmax

– Reduce computational overhead – Avoid overfitting to training data

∑ n(wt , d) log p(wt |cj )

Precision/Recall Breakeven Point

Feature Selection

100

1000 Vocabulary Size

10000

100000

Macro Recall 100 5,000 20,000 MI 76.1 75.1 78.3 KL 80.1 (+5.3%) 80.7 (+7.4%) 82.2 (+5.0%) dKL 78.8 (+3.7%) 78.0 (+3.9%) 78.5 (+0.2%)

ing corpus: |V |

MI(X ; Y ) =

Nj e K (S) = − ∑ p(wt ) log |S | t=1

p(x, y )

∑ ∑ p(x, y ) log p(x)p(y ) x

y

• Application to feature selection: measure the mutual information between a word and the class variable

• Requires individual random variables for all words [5, 6]: p(Wt = 1) = p(W = wt )

Conclusions

• KL-divergence score for wt : |C |

Njt Nj f e KL(wt ) = Kt (S) − KLt (S) = ∑ p(cj ) p(wt |cj ) log − p(wt ) log |cj | |S | j=1

• Wt models the occurrence of wt at any given position • Mutual information feature score: MI(wt ) = MI(Wt ; C) • Random variables Wt are not independent

• Feature score based on approximation of KL-divergence: – comparable to or slightly better than mutual information – better performance on smaller categories of Reuters 21578 • Feature score based on true KL-divergence (future work): – considerably higher performance than mutual information on various datasets

– automatic feature subset selection for maximum performance

References [1] Thomas M. Cover and Joy A. Thomas. Elements of Information Theory. John Wiley, New York, 1991. [2] Inderjit S. Dhillon, Subramanyam Mallela, and Rahul Kumar. Enhanced word clustering for hierarchical text classification. In Proc. 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pages 191–200, 2002. [3] Pedro Domingos and Michael Pazzani. On the optimality of the simple bayesian classifier under zero-one loss. Machine Learning, 29:103–130, 1997. [4] S. Kullback and R. A. Leibler. On information and sufficiency. Annals of Mathematical Statistics, 22:79–86, 1951. [5] Andrew McCallum and Kamal Nigam. A comparison of event models for Naive Bayes text classification. In Learning for Text Categorization: Papers from the AAAI Workshop, pages 41–48. AAAI Press, 1998. Technical Report WS-98-05. [6] Jason D. M. Rennie. Improving multi-class text classification with Naive Bayes. Master’s thesis, Massachusetts Institute of Technology, 2001. [7] Yiming Yang and Jan O. Pedersen. A comparative study on feature selection in text categorization. In Proc. 14th International Conference on Machine Learning (ICML-97), pages 412–420, 1997.