Topic-based Vietnamese News Document Filtering in the BioCaster Project Vu HOANG Nguyen NGUYEN Dien DINH Nigel COLLIER University of Natural University of University of Natural Institute of Informatics, Sciences, VNU-HCM, Information Sciences, VNU-HCM, Tokyo, Japan
[email protected] Vietnam Technology, VNUVietnam
[email protected]. HCM, Vietnam
[email protected]. vn
[email protected] vn Abstract In this paper, we describe a topic-based Vietnamese news document filtering (VTDF) system in the BioCaster Project which automatically classifies news documents from a wide variety of sources into relevant topics suitable for disease outbreak detection. Given the very large numbers of news reports that have to be analyzed each day, VTDF is a crucial preprocessing step in reducing the burden of semantic annotation. Here we present two different approaches for the Vietnamese document classification problem which will be used in the VTDF system. By using the Bag Of Words – BOW and Statistical N-Gram Language Modeling – N-Gram approaches we were able to evaluate these two widely used classification approaches for our task and showed that N-Gram could achieve an average of 95% accuracy with an average 79 minutes filtering time for about 14,000 documents (3 docs/sec).
regional languages and present a summarized translation in the local language. The text mining system is being based on an application ontology with knowledge of domain concepts and relations such as infectious agents, the diseases they cause in humans, symptoms and findings, drugs as well as regional locators. In addition to providing an intelligent inference capability the ontology also provides crosslanguage terminology correspondence, enabling the translation of event frames between languages. Within BioCaster, topic-based news document filtering has two key roles. The first is to speed up compilation of domain corpora for training named entity recognizers (NERs). The second and main role shown in Figure 1 is to act as a preprocessing step for reducing the burden of more intensive semantic annotation at later stages.
1. Introduction 1.1. BioCaster Project: Detection and Tracking of Disease Outbreaks from Multilingual News Texts BioCaster is a collaborative project for the purpose of early detection and tracking of newly emerging or re-emerging disease outbreaks in the Asia-Pacific region. Against the background of recent regional outbreaks such as SARS and the spread of zoonotic Avian H5N1 influenza as well as the patchy nature of surveillance system infrastructure, a number of groups in three countries (Japan, Thailand and Vietnam), have begun to develop an Internet-based surveillance system to complement existing efforts by public health agencies. The BioCaster project is being developed as a Web-portal using the latest text mining technology that can filter news reports in various
Figure 1: Overview of document filtering in the BioCaster system
1.2. Topic-based document classification Document classification - DC (also called document categorization) has been described as the activity of labeling natural language text with thematic categories from the predefined set. DC is used in many applicative contexts, such as: automatic document indexing, document filtering, automated metadata generation and so on.
Topic-based document classification, in contrast aims to find all the news reports or documents from different domains relevant to a particular user-defined topic or topics [8][9]. In this paper, we study the effects of adapting document filtering to the topic of disease outbreaks in Vietnam 1 reported in the local Vietnamese language news. The task is challenging because the topic crosses traditional domain categories, for example: 9 An increase in drug sales or sales of traditional medicines, reported in business news, could indicate a local disease outbreak 9
An analysis of a new vaccine for treating an outbreak could be reported under science and technology news
9
A new case of an infectious disease could be reported in top news or world news
9
An analysis of the spread of a human disease could be reported in health news whereas the spread of an animal disease could be reported in regional news
Topic-based news document classification can be shown to be a highly effective task which however is expensive to do on so many documents as we expect to process – over 10,000 every day of which perhaps 600 are relevant. The rest of this paper is organized as follows: in Section 2 we discuss related work, Section 3 then presents our model and the processing resources for Vietnamese, Section 4 gives the results of experiments we conducted and Section 5 reports our conclusions and future work.
1.3. The difference between Vietnamese and English 1.3.1. Vietnamese characters: Modern Vietnamese is written with the Latin alphabet, known as “Quốc ngữ” (National Script) in Vietnamese. They consists of 29 letters to transcribe the sound of phonemes, including: 22 Latin-letters (a,b,c,d,e, g,h,i,k,l, m,n,o,p,q, r,s,t,u,v, x,y) and 7 modified Latin-letters using diacritics (ă, â, đ, ê, ô, ơ, ư). 1.3.2. Vietnamese word boundary: The most obvious point of difference between English and Vietnamese is 1
According to WHO data for laboratory-confirmed cases nearly 24% of reported suspected and probable cases of SARS (Feb 1 – Mar 17 2003) and 38% of Avian Influenza A/(H5N1) cases (2003 - 2006) were in Vietnam.
in word boundary identification. In Vietnamese, the boundaries between words are not always spaces as those in English. Vietnamese writing has monosyllabic nature. Every “syllable” is written as though it were a separate dictation-unit with a space before and after. This unit is called “tiếng” in Vietnamese. Each “tiếng” tends to have its own meaning and thus a strong identity. We consider “tiếng” a Vietnamese morpheme (or technically “morpho-syllable”). That is: one or many (up to 4) morpho-syllable(s) go together to form a single word, which can be identified grammatically or semantically correct by its context. Example: in the sentence: “Một luật gia cầm cự với tình hình hiện nay” has 10 morpho-syllables (or “tiếng”). The comparison of Vietnamese and English word segmentation is shown in the Figure 2:
Figure 2: An ambiguous example in Vietnamese word segmentation In dictionary, these ten morpho-syllables were 10 words with their own meanings. But in this sentence, some of them are morphemes only. There are many different ways of word-segmentation, but only two of them are grammatically correct and one of them (the number Vnese1 is more reasonable in semantics ) as follows: 1) “A lawyer contends with the present situation” (“Một luật_gia cầm_cự với tình_hình hiện_nay”) 2) “A law poultry resists the present situation” (“Một luật gia_cầm cự với tình_hình hiện_nay”) In this example, there are more than one way of understanding. If we segment words in way 1 (the better one in semantics), we may classify this document into category "politics-society". But if we segment words in way 2, we may classify this document into category “Health” (Avian-Flu topics). This implies that the word segmentation is a necessary problem which affects to the topics-based document classification. This problem needs to be solved in the preprocessing step before further processing can take place.
2. Related Works Numerous efforts have been made to perform automatic document classification on biomedical
related text collections [12]. Text classification has been performed on various subsections of the biomedical science literature including documents found in the Medline database, molecular biology texts, cell biology texts, as well as clinical narratives [13]. Automatic classification of medical texts has been of great interest since the early 1990s given the increasing volume of biomedical texts, the need to expedite the extraction of relevant medical facts & evidence as well as the need for applying the identified knowledge to particular clinical situations [12]. While numerous text classification projects can be found with respect to the biomedical research literature and to clinical text data, few if any such projects can be found with respect to document classification relating to the topic of disease outbreaks in news reports from different domain categories. The challenge for us is somewhat different to that in more formal etiological analysis from journal articles. For example, terminology may be incorrectly applied, the language used to describe the topics can be exaggerated, facts are often vague and key words and phrases can be creatively misused. This last case is exemplified by the headlines below: 9 Ex 1. Soccer fever has spread to the U.S. 9
Ex 2. Succession fever to dog Blair
9
Ex 3. Residents hit by shopping bug
In this paper, we have applied the best document classification techniques [1], which have been previously evaluated on English texts. To the best of our knowledge this is the first time that these techniques have been used for Vietnamese. The survey reported in [1] shows that document classification in English has generally achieved satisfactory results with the results on some standard corpora such as Reuters, Ohsumed and 20 Newsgroups 2 ranging from 80 to 93%. However, the reported results for Vietnamese are very restricted and tend to be based on small data sets (from 50 to 100 files per topic) which are not publicly available for independent analysis. Evaluating the performance therefore for Vietnamese is very subjective and it is difficult to identify the best methods. To overcome these problems, we propose the following methodology: 1) Corpus construction: we constructed a Vietnamese corpus which satisfies the conditions of sufficiency, objectiveness and balance. A detailed description of the corpus will be discussed in the next section 2) Filtering model: the document classification problem usually has three main approaches: 2
http://ai-nlp.info.uniroma2.it/moschitti/corpora.htm
9
Bag of Words – BOW based Approach [5]
9
Statistical N-Gram Language Modeling based Approach [6]
It is expected that each approach will have an advantage/disadvantage for different languages. In this paper therefore we concentrate on analysis the performance, strengths, and speed of each approach in the document filtering problem, especially for Vietnamese language.
3. Method 3.1. Preparing the Corpus We built a Vietnamese corpus based on the four largest circulation Vietnamese online newspapers: VnExpress, TuoiTre Online, Thanh Nien Online, Nguoi Lao Dong Online. The collected texts are automatically preprocessed (removing the HTML tags, spelling normalization) by Teleport software and various heuristics. There followed a stage of manual correction by linguists who semi-automatically reviewed and adjusted the documents which are classified to the wrong topics. Finally, we obtained a relatively large and sufficient corpus which includes about 100,000 documents: Level 1 Level 1 includes some top categories from the above popular news websites. These categories will relate to the topic of disease outbreaks which we consider to be relevant to the BioCaster Project. This contains about 33,759 documents for training and 50,373 documents for testing. These documents are classified by journalists and then passed a carefull preprocessing step (see above part). Level 2 Level 2 includes the topics of disease outbreaks. In this experiment, we only focus on two main categories: disease (mainly bird-flu, both human and animal cases) and non-disease. The documents of the corpus level 2 is extracted from the corpus level 1. In the future, we will extend this to other categories of nationally notifiable diseases contained in the ontology such as: SARS, HIV-AIDS, cancer, measles and tuberculosis, etc.. Level 2 contains about 14,375 documents for training and 12,076 documents for testing.
3.2. The General Architecture The general architecture of the filtering system is shown in Figure 3:
Figure 3: The general architecture and the role of document filtering
3.3. Vietnamese Document Classification (DC) Module The general model of the DC Module is:
Figure 4: The general document classification model 3.3.1. The BOW-based Approach: In this approach the text document is transformed into a feature vector, where a feature is a single token or word. Preprocessing - Tokenization: We use the best word segmentation for Vietnamese in [11] as a tokenizing in this BOW approach. All documents are segmented into words or tokens that are inputs for next steps. - Removing stop words: In this phase the relevant features (called as tokens) are extracted from documents. After, the set of tokens is extracted it can be improved by removing features that do not bring any information. Function words (e.g., “và”, “của” and “nhất là”) are removed. - Weighting Schemes: Every text document which is input is firstly transformed into a list of words obtained by selecting
only those which are not present in a list of stop words. Then the words are matched against the term dictionary. Each entry in dictionary includes current text, term frequency, the number of documents containing the term, idf (Inverse Document Frequency) frequency. To weight the elements we use the standard tf idf product, with tf the term frequency in the document, and idf=log(n/df(i)) with n the number of documents in the collection and df(i) the number of documents in the collection containing the word i, and pointers are obtained to words known to the system. - Dimension Reduction – Feature Extraction and Selection Dimension reduction techniques can generally be classified into Feature Extraction (FE) approaches [3] and Feature Selection (FS) [2][4]. According to our best knowledge, FS algorithms are more popular for real life text data dimension reduction problems. In this paper, we only consider the FS algorithms. There has been much research done on feature selection in text classification [1] such as: MI (Mutual Information), IG (Information Gain), GSS (GSS coefficient), CHI (Chi-square), OR (Odds Ratio), DIA association factor, RS (Relevancy score). Recently, the work in [10] has been shown that OCFS (Optimal Orthogonal Centroid Feature Selection) is consistently better than IG and CHI with smaller computation time especially when the reduced dimension is extremely small. So, we will implement six methods which are best in English text classification: MI, IG, GSS, CHI, OR, and especially OCFS. From our experiments, we will find feature selection methods which are best for Vietnamese document classification. For the classification model we chose Support Vector Machines – SVM, the best algorithm based on machine learning which has been widely applied to text classification [7]. 3.3.2. Statistical N-Gram Language Modeling based Approach Preprocessing At this stage, we first pass the documents for spelling standardization ex: hòa Æ hoà, thời kỳ Æ thời kì. Then, they are passed the sentence and paragraph segmentation steps (will be used in afterwards probability calculation). N-gram model and n-gram model based classifier This is a new approach for text classification [6], that has been successfully applied in Chinese and Japanese languages. In this paper, we initially use this new model for Vietnamese and compare with other traditional methods (BOW approach).
In this paper, we consider text in document as a concatenated sequence of morpho-syllables instead of words. There are two main reasons: 1) We want to avoid the Vietnamese word segmentation problem which is proved to be a very difficult problem. 2) A morpho-syllable-based n-gram language model is smaller and it reduces the sparse data problem.
4. Experiments And Results
OCFS for features selection in Vietnamese topic-based document classification.
Figure 6: Evaluation with different document classification methods (Corpus Level 1)
Two recall and precision parameters are used to evaluate the classification models [1]:
F1 =
2 * recall * precision (recall + precision)
Additionally, the total accuracy of the corpus is calculated from the average accuracy of all categories. In this phase, we define some following abbreviations: SVM-Multi: SVMs with multi-class SVM-Binary: SVMs with binary-class kNNs: k Nearest Neighbours model [4][13] N-Gram: Statistical N-Gram Language Modeling To systematically evaluate the Vietnamese document classification models, we investigate the comparison of several feature selection methods (MI, IG, CHI, GSS, OR, OCFS), and different learning machine models (SVMs, kNNs, N-Gram).
Figure 5: Feature Selection Methods Evaluation (2,500 terms) (Corpus Level 2) In [10] authors prove that the proposed OCFS is consistently better than IG and CHI especially when the reduced dimension is extremely small for text categorization problems. To experiment with Vietnamese corpus, we compare OCFS with another traditional features selection such as: CHI, GSS, IG, OR, MI. After using different features selection, we use SVM to classification documents. With our result (Fig 5), we proved that OCFS is the best features selection algorithm. So that, we used
Figure 7: Evaluation with different document classification methods (Corpus Level 2) In SVM training models, we choose the following parameters: C=1, 10; kernel-function = linear; SVMtype = C-SVM; other default parameters. In kNNs model, we choose k=30 [13]. In N-Gram model, we choose N=2 and other default parameters. With the documents classification methods using BOW-based approach such as SVM, kNNs, corpus has smaller category which has higher experimental result of classification. The reason is that with the same word segmentation process, the same features selection algorithm, the input features has become exactly when corpus is compact. So that, the result of experiment with corpus level 2 is higher than the result of experiment with corpus level 1. In the opposite, with the documents classification using statistical method such as N-Gram, when corpus is lager, the statistic result is better. According to this reason, we can explain why result of experiment with corpus level 1 is higher than result of experiment with corpus level 2. By experiment, we can see that the proposed NGram is the best for Vietnamese topic-based documents classification method for our system when we have larger corpus.
Figure 8: Evaluation with time of learning (14375 docs) and testing (12075 docs) (Corpus Level 2)
With the advantages of statistical, the speed of NGram is the highest in 4 methods which using in our experiments. So we can conclusion that N-Gram is absolutely to adapt for using in the online documents classification system.
5. Conclusion And Future Work Vietnamese topic-based text filtering is a key task in reducing the processing burden on high throughput text mining for automatic infectious disease detection. Our experiments also showed that SVM models and NGram model obtain a significant accuracy in filtering (about 95%). Moreover, the N-Gram model seems to be preferable to SVM for the following reasons: the higher filtering speed, avoidance of the word segmentation and explicit feature selection procedure, and giving the equivalent F1-score result. However, we also recognize that the system remains some errors from classification methods such as: 1) The limitations from tokenization (word segmentation tool) effects to quality of classification (in BOW approach); 2) The documents have the ambiguities between two or many topics because these documents have too many tokens or phrases which both express the content of topic. Our results show that our approach is suitable for Vietnamese topic based documents filtering, and is satisfactory in terms of both processing time and accuracy. However, in the future, we could combine more semantic and contextual features (e.g. Latent Semantic Indexing – LSI [14]) to improve our system for handling polysemy and synonymy.
6. Acknowledgement We would like to thank the Global Liason Office of National Institute of Informatics in Tokyo for granting us the travel fund to research this problem. Finally, we also sincerely thank colleagues in the VCL Group (Vietnamese Computational Linguistics) for their invaluable and insightful comments.
7. References [1] Fabrizio Sebastiani. 2002. Machine Learning in Automated Text Categorization. ACM Computing Surveys, Vol. 34, No. 1, March 2002, pp.1- 47 [2] Lewis D.D. Feature Selection and Feature Extraction for Text Categorization. In Proceedings of the Speech and Natural Language Workshop, (1992)
[3] Liu, H. and Motoda, H. Feature Extraction, Construction and Selection: A Data Mining Perspective. Kluwer Academic, Norwel, MA, USA, 1998. [4] Yang, Y. and Pedersen, J.O., A comparative Study on Feature Selection in Text Categorization. In Proceedings of the 14th International Conference on Machine Learning (ICML), (1997), 412-420. [5] Ciya Liao, Shamim Alpha, Paul Dixon. Oracle Corporation. 2003. Feature preparation in Text Categorization, AusDM03 Conference. [6] Fuchen Peng, Dale Schuurmans, Shaojun Wang. (2004). Augmenting Naïve Bayes Classifiers with Statistical Language Models, Information Retrieval, 7, p317-345. [7] Thorsten Joachims. Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In C. Nedellec and C. Rouveirol, editors, Proceedings of ECML-98, 10th European Conference on Machine Learning, number 1398, pages 137—142. [8] M. Ikonomakis, S. Kotsiantis, V. Tampakas, Text Classification Using Machine Learning Techniques. WSEAS Transactions on Computers, Issue 8, Volume 4, August 2005, pp. 966-974 [9] Casey Whitelaw, Jon Patrick. Selecting Systemic Features for Text Classification. In Proceedings of the Australian Language Technology Workshop 2004, Australia. [10] Jun Yan, Ning Liu, Benyu Zhang, Shuicheng Yan, Zheng Chen, Qiansheng Cheng, Weiguo Fan. (2005). OCFS: Optimal Orthogonal Centroid Feature Selection for Text Categorization (2005). ACM 2005. [11] Dinh Dien, Vu Thuy (2006), “A maximum entropy approach for Vietnamese word segmentation”. Proceedings of 4th IEEE International Conference on Computer Science - Research, Innovation and Vision of the Future 2006 (RIVF’06). Ho Chi Minh City , Vietnam , Feb 12-16, 2006, pp 247 – 252. [12] De Bruijn, B., & Martin, J. (2002). Getting to the (C) Ore of Knowledge: Mining Biomedical Literature. International Journal of Medical Informatics, 67 (1-3), 7-18. [13] Yang, Y. M., & Chute, C. G. (1994). An ExampleBased Mapping Method for Text Categorization and Retrieval. ACM Transactions on Information Systems, 12 (3), 252-277. [14] Tao Liu, Zheng Chen, Benyu Zhang, Wei-ying Ma, Gongyi Wu (2004). Improving Text Classification using Local Latent Semantic Indexing, Data Mining, 2004. ICDM 2004. Proceedings, Fourth IEEE International Conference.