Topic-based Vietnamese News Document Filtering in the BioCaster Project Vu HOANG Nguyen NGUYEN Dien DINH Nigel COLLIER University of Natural University of University of Natural Institute of Informatics, Sciences, VNU-HCM, Information Sciences, VNU-HCM, Tokyo, Japan [email protected] Vietnam Technology, VNUVietnam [email protected]. HCM, Vietnam [email protected]. vn [email protected] vn Abstract In this paper, we describe a topic-based Vietnamese news document filtering (VTDF) system in the BioCaster Project which automatically classifies news documents from a wide variety of sources into relevant topics suitable for disease outbreak detection. Given the very large numbers of news reports that have to be analyzed each day, VTDF is a crucial preprocessing step in reducing the burden of semantic annotation. Here we present two different approaches for the Vietnamese document classification problem which will be used in the VTDF system. By using the Bag Of Words – BOW and Statistical N-Gram Language Modeling – N-Gram approaches we were able to evaluate these two widely used classification approaches for our task and showed that N-Gram could achieve an average of 95% accuracy with an average 79 minutes filtering time for about 14,000 documents (3 docs/sec).

regional languages and present a summarized translation in the local language. The text mining system is being based on an application ontology with knowledge of domain concepts and relations such as infectious agents, the diseases they cause in humans, symptoms and findings, drugs as well as regional locators. In addition to providing an intelligent inference capability the ontology also provides crosslanguage terminology correspondence, enabling the translation of event frames between languages. Within BioCaster, topic-based news document filtering has two key roles. The first is to speed up compilation of domain corpora for training named entity recognizers (NERs). The second and main role shown in Figure 1 is to act as a preprocessing step for reducing the burden of more intensive semantic annotation at later stages.

1. Introduction 1.1. BioCaster Project: Detection and Tracking of Disease Outbreaks from Multilingual News Texts BioCaster is a collaborative project for the purpose of early detection and tracking of newly emerging or re-emerging disease outbreaks in the Asia-Pacific region. Against the background of recent regional outbreaks such as SARS and the spread of zoonotic Avian H5N1 influenza as well as the patchy nature of surveillance system infrastructure, a number of groups in three countries (Japan, Thailand and Vietnam), have begun to develop an Internet-based surveillance system to complement existing efforts by public health agencies. The BioCaster project is being developed as a Web-portal using the latest text mining technology that can filter news reports in various

Figure 1: Overview of document filtering in the BioCaster system

1.2. Topic-based document classification Document classification - DC (also called document categorization) has been described as the activity of labeling natural language text with thematic categories from the predefined set. DC is used in many applicative contexts, such as: automatic document indexing, document filtering, automated metadata generation and so on.

Topic-based document classification, in contrast aims to find all the news reports or documents from different domains relevant to a particular user-defined topic or topics [8][9]. In this paper, we study the effects of adapting document filtering to the topic of disease outbreaks in Vietnam 1 reported in the local Vietnamese language news. The task is challenging because the topic crosses traditional domain categories, for example: 9 An increase in drug sales or sales of traditional medicines, reported in business news, could indicate a local disease outbreak 9

An analysis of a new vaccine for treating an outbreak could be reported under science and technology news

9

A new case of an infectious disease could be reported in top news or world news

9

An analysis of the spread of a human disease could be reported in health news whereas the spread of an animal disease could be reported in regional news

Topic-based news document classification can be shown to be a highly effective task which however is expensive to do on so many documents as we expect to process – over 10,000 every day of which perhaps 600 are relevant. The rest of this paper is organized as follows: in Section 2 we discuss related work, Section 3 then presents our model and the processing resources for Vietnamese, Section 4 gives the results of experiments we conducted and Section 5 reports our conclusions and future work.

1.3. The difference between Vietnamese and English 1.3.1. Vietnamese characters: Modern Vietnamese is written with the Latin alphabet, known as “Quốc ngữ” (National Script) in Vietnamese. They consists of 29 letters to transcribe the sound of phonemes, including: 22 Latin-letters (a,b,c,d,e, g,h,i,k,l, m,n,o,p,q, r,s,t,u,v, x,y) and 7 modified Latin-letters using diacritics (ă, â, đ, ê, ô, ơ, ư). 1.3.2. Vietnamese word boundary: The most obvious point of difference between English and Vietnamese is 1

According to WHO data for laboratory-confirmed cases nearly 24% of reported suspected and probable cases of SARS (Feb 1 – Mar 17 2003) and 38% of Avian Influenza A/(H5N1) cases (2003 - 2006) were in Vietnam.

in word boundary identification. In Vietnamese, the boundaries between words are not always spaces as those in English. Vietnamese writing has monosyllabic nature. Every “syllable” is written as though it were a separate dictation-unit with a space before and after. This unit is called “tiếng” in Vietnamese. Each “tiếng” tends to have its own meaning and thus a strong identity. We consider “tiếng” a Vietnamese morpheme (or technically “morpho-syllable”). That is: one or many (up to 4) morpho-syllable(s) go together to form a single word, which can be identified grammatically or semantically correct by its context. Example: in the sentence: “Một luật gia cầm cự với tình hình hiện nay” has 10 morpho-syllables (or “tiếng”). The comparison of Vietnamese and English word segmentation is shown in the Figure 2:

Figure 2: An ambiguous example in Vietnamese word segmentation In dictionary, these ten morpho-syllables were 10 words with their own meanings. But in this sentence, some of them are morphemes only. There are many different ways of word-segmentation, but only two of them are grammatically correct and one of them (the number Vnese1 is more reasonable in semantics ) as follows: 1) “A lawyer contends with the present situation” (“Một luật_gia cầm_cự với tình_hình hiện_nay”) 2) “A law poultry resists the present situation” (“Một luật gia_cầm cự với tình_hình hiện_nay”) In this example, there are more than one way of understanding. If we segment words in way 1 (the better one in semantics), we may classify this document into category "politics-society". But if we segment words in way 2, we may classify this document into category “Health” (Avian-Flu topics). This implies that the word segmentation is a necessary problem which affects to the topics-based document classification. This problem needs to be solved in the preprocessing step before further processing can take place.

2. Related Works Numerous efforts have been made to perform automatic document classification on biomedical

related text collections [12]. Text classification has been performed on various subsections of the biomedical science literature including documents found in the Medline database, molecular biology texts, cell biology texts, as well as clinical narratives [13]. Automatic classification of medical texts has been of great interest since the early 1990s given the increasing volume of biomedical texts, the need to expedite the extraction of relevant medical facts & evidence as well as the need for applying the identified knowledge to particular clinical situations [12]. While numerous text classification projects can be found with respect to the biomedical research literature and to clinical text data, few if any such projects can be found with respect to document classification relating to the topic of disease outbreaks in news reports from different domain categories. The challenge for us is somewhat different to that in more formal etiological analysis from journal articles. For example, terminology may be incorrectly applied, the language used to describe the topics can be exaggerated, facts are often vague and key words and phrases can be creatively misused. This last case is exemplified by the headlines below: 9 Ex 1. Soccer fever has spread to the U.S. 9

Ex 2. Succession fever to dog Blair

9

Ex 3. Residents hit by shopping bug

In this paper, we have applied the best document classification techniques [1], which have been previously evaluated on English texts. To the best of our knowledge this is the first time that these techniques have been used for Vietnamese. The survey reported in [1] shows that document classification in English has generally achieved satisfactory results with the results on some standard corpora such as Reuters, Ohsumed and 20 Newsgroups 2 ranging from 80 to 93%. However, the reported results for Vietnamese are very restricted and tend to be based on small data sets (from 50 to 100 files per topic) which are not publicly available for independent analysis. Evaluating the performance therefore for Vietnamese is very subjective and it is difficult to identify the best methods. To overcome these problems, we propose the following methodology: 1) Corpus construction: we constructed a Vietnamese corpus which satisfies the conditions of sufficiency, objectiveness and balance. A detailed description of the corpus will be discussed in the next section 2) Filtering model: the document classification problem usually has three main approaches: 2

http://ai-nlp.info.uniroma2.it/moschitti/corpora.htm

9

Bag of Words – BOW based Approach [5]

9

Statistical N-Gram Language Modeling based Approach [6]

It is expected that each approach will have an advantage/disadvantage for different languages. In this paper therefore we concentrate on analysis the performance, strengths, and speed of each approach in the document filtering problem, especially for Vietnamese language.

3. Method 3.1. Preparing the Corpus We built a Vietnamese corpus based on the four largest circulation Vietnamese online newspapers: VnExpress, TuoiTre Online, Thanh Nien Online, Nguoi Lao Dong Online. The collected texts are automatically preprocessed (removing the HTML tags, spelling normalization) by Teleport software and various heuristics. There followed a stage of manual correction by linguists who semi-automatically reviewed and adjusted the documents which are classified to the wrong topics. Finally, we obtained a relatively large and sufficient corpus which includes about 100,000 documents: ™ Level 1 Level 1 includes some top categories from the above popular news websites. These categories will relate to the topic of disease outbreaks which we consider to be relevant to the BioCaster Project. This contains about 33,759 documents for training and 50,373 documents for testing. These documents are classified by journalists and then passed a carefull preprocessing step (see above part). ™ Level 2 Level 2 includes the topics of disease outbreaks. In this experiment, we only focus on two main categories: disease (mainly bird-flu, both human and animal cases) and non-disease. The documents of the corpus level 2 is extracted from the corpus level 1. In the future, we will extend this to other categories of nationally notifiable diseases contained in the ontology such as: SARS, HIV-AIDS, cancer, measles and tuberculosis, etc.. Level 2 contains about 14,375 documents for training and 12,076 documents for testing.

3.2. The General Architecture The general architecture of the filtering system is shown in Figure 3:

Figure 3: The general architecture and the role of document filtering

3.3. Vietnamese Document Classification (DC) Module The general model of the DC Module is:

Figure 4: The general document classification model 3.3.1. The BOW-based Approach: In this approach the text document is transformed into a feature vector, where a feature is a single token or word. ™ Preprocessing - Tokenization: We use the best word segmentation for Vietnamese in [11] as a tokenizing in this BOW approach. All documents are segmented into words or tokens that are inputs for next steps. - Removing stop words: In this phase the relevant features (called as tokens) are extracted from documents. After, the set of tokens is extracted it can be improved by removing features that do not bring any information. Function words (e.g., “và”, “của” and “nhất là”) are removed. - Weighting Schemes: Every text document which is input is firstly transformed into a list of words obtained by selecting

only those which are not present in a list of stop words. Then the words are matched against the term dictionary. Each entry in dictionary includes current text, term frequency, the number of documents containing the term, idf (Inverse Document Frequency) frequency. To weight the elements we use the standard tf idf product, with tf the term frequency in the document, and idf=log(n/df(i)) with n the number of documents in the collection and df(i) the number of documents in the collection containing the word i, and pointers are obtained to words known to the system. - Dimension Reduction – Feature Extraction and Selection Dimension reduction techniques can generally be classified into Feature Extraction (FE) approaches [3] and Feature Selection (FS) [2][4]. According to our best knowledge, FS algorithms are more popular for real life text data dimension reduction problems. In this paper, we only consider the FS algorithms. There has been much research done on feature selection in text classification [1] such as: MI (Mutual Information), IG (Information Gain), GSS (GSS coefficient), CHI (Chi-square), OR (Odds Ratio), DIA association factor, RS (Relevancy score). Recently, the work in [10] has been shown that OCFS (Optimal Orthogonal Centroid Feature Selection) is consistently better than IG and CHI with smaller computation time especially when the reduced dimension is extremely small. So, we will implement six methods which are best in English text classification: MI, IG, GSS, CHI, OR, and especially OCFS. From our experiments, we will find feature selection methods which are best for Vietnamese document classification. For the classification model we chose Support Vector Machines – SVM, the best algorithm based on machine learning which has been widely applied to text classification [7]. 3.3.2. Statistical N-Gram Language Modeling based Approach ™ Preprocessing At this stage, we first pass the documents for spelling standardization ex: hòa Æ hoà, thời kỳ Æ thời kì. Then, they are passed the sentence and paragraph segmentation steps (will be used in afterwards probability calculation). ™ N-gram model and n-gram model based classifier This is a new approach for text classification [6], that has been successfully applied in Chinese and Japanese languages. In this paper, we initially use this new model for Vietnamese and compare with other traditional methods (BOW approach).

In this paper, we consider text in document as a concatenated sequence of morpho-syllables instead of words. There are two main reasons: 1) We want to avoid the Vietnamese word segmentation problem which is proved to be a very difficult problem. 2) A morpho-syllable-based n-gram language model is smaller and it reduces the sparse data problem.

4. Experiments And Results

OCFS for features selection in Vietnamese topic-based document classification.

Figure 6: Evaluation with different document classification methods (Corpus Level 1)

Two recall and precision parameters are used to evaluate the classification models [1]:

F1 =

2 * recall * precision (recall + precision)

Additionally, the total accuracy of the corpus is calculated from the average accuracy of all categories. In this phase, we define some following abbreviations: SVM-Multi: SVMs with multi-class SVM-Binary: SVMs with binary-class kNNs: k Nearest Neighbours model [4][13] N-Gram: Statistical N-Gram Language Modeling To systematically evaluate the Vietnamese document classification models, we investigate the comparison of several feature selection methods (MI, IG, CHI, GSS, OR, OCFS), and different learning machine models (SVMs, kNNs, N-Gram).

Figure 5: Feature Selection Methods Evaluation (2,500 terms) (Corpus Level 2) In [10] authors prove that the proposed OCFS is consistently better than IG and CHI especially when the reduced dimension is extremely small for text categorization problems. To experiment with Vietnamese corpus, we compare OCFS with another traditional features selection such as: CHI, GSS, IG, OR, MI. After using different features selection, we use SVM to classification documents. With our result (Fig 5), we proved that OCFS is the best features selection algorithm. So that, we used

Figure 7: Evaluation with different document classification methods (Corpus Level 2) In SVM training models, we choose the following parameters: C=1, 10; kernel-function = linear; SVMtype = C-SVM; other default parameters. In kNNs model, we choose k=30 [13]. In N-Gram model, we choose N=2 and other default parameters. With the documents classification methods using BOW-based approach such as SVM, kNNs, corpus has smaller category which has higher experimental result of classification. The reason is that with the same word segmentation process, the same features selection algorithm, the input features has become exactly when corpus is compact. So that, the result of experiment with corpus level 2 is higher than the result of experiment with corpus level 1. In the opposite, with the documents classification using statistical method such as N-Gram, when corpus is lager, the statistic result is better. According to this reason, we can explain why result of experiment with corpus level 1 is higher than result of experiment with corpus level 2. By experiment, we can see that the proposed NGram is the best for Vietnamese topic-based documents classification method for our system when we have larger corpus.

Figure 8: Evaluation with time of learning (14375 docs) and testing (12075 docs) (Corpus Level 2)

With the advantages of statistical, the speed of NGram is the highest in 4 methods which using in our experiments. So we can conclusion that N-Gram is absolutely to adapt for using in the online documents classification system.

5. Conclusion And Future Work Vietnamese topic-based text filtering is a key task in reducing the processing burden on high throughput text mining for automatic infectious disease detection. Our experiments also showed that SVM models and NGram model obtain a significant accuracy in filtering (about 95%). Moreover, the N-Gram model seems to be preferable to SVM for the following reasons: the higher filtering speed, avoidance of the word segmentation and explicit feature selection procedure, and giving the equivalent F1-score result. However, we also recognize that the system remains some errors from classification methods such as: 1) The limitations from tokenization (word segmentation tool) effects to quality of classification (in BOW approach); 2) The documents have the ambiguities between two or many topics because these documents have too many tokens or phrases which both express the content of topic. Our results show that our approach is suitable for Vietnamese topic based documents filtering, and is satisfactory in terms of both processing time and accuracy. However, in the future, we could combine more semantic and contextual features (e.g. Latent Semantic Indexing – LSI [14]) to improve our system for handling polysemy and synonymy.

6. Acknowledgement We would like to thank the Global Liason Office of National Institute of Informatics in Tokyo for granting us the travel fund to research this problem. Finally, we also sincerely thank colleagues in the VCL Group (Vietnamese Computational Linguistics) for their invaluable and insightful comments.

7. References [1] Fabrizio Sebastiani. 2002. Machine Learning in Automated Text Categorization. ACM Computing Surveys, Vol. 34, No. 1, March 2002, pp.1- 47 [2] Lewis D.D. Feature Selection and Feature Extraction for Text Categorization. In Proceedings of the Speech and Natural Language Workshop, (1992)

[3] Liu, H. and Motoda, H. Feature Extraction, Construction and Selection: A Data Mining Perspective. Kluwer Academic, Norwel, MA, USA, 1998. [4] Yang, Y. and Pedersen, J.O., A comparative Study on Feature Selection in Text Categorization. In Proceedings of the 14th International Conference on Machine Learning (ICML), (1997), 412-420. [5] Ciya Liao, Shamim Alpha, Paul Dixon. Oracle Corporation. 2003. Feature preparation in Text Categorization, AusDM03 Conference. [6] Fuchen Peng, Dale Schuurmans, Shaojun Wang. (2004). Augmenting Naïve Bayes Classifiers with Statistical Language Models, Information Retrieval, 7, p317-345. [7] Thorsten Joachims. Text Categorization with Support Vector Machines: Learning with Many Relevant Features. In C. Nedellec and C. Rouveirol, editors, Proceedings of ECML-98, 10th European Conference on Machine Learning, number 1398, pages 137—142. [8] M. Ikonomakis, S. Kotsiantis, V. Tampakas, Text Classification Using Machine Learning Techniques. WSEAS Transactions on Computers, Issue 8, Volume 4, August 2005, pp. 966-974 [9] Casey Whitelaw, Jon Patrick. Selecting Systemic Features for Text Classification. In Proceedings of the Australian Language Technology Workshop 2004, Australia. [10] Jun Yan, Ning Liu, Benyu Zhang, Shuicheng Yan, Zheng Chen, Qiansheng Cheng, Weiguo Fan. (2005). OCFS: Optimal Orthogonal Centroid Feature Selection for Text Categorization (2005). ACM 2005. [11] Dinh Dien, Vu Thuy (2006), “A maximum entropy approach for Vietnamese word segmentation”. Proceedings of 4th IEEE International Conference on Computer Science - Research, Innovation and Vision of the Future 2006 (RIVF’06). Ho Chi Minh City , Vietnam , Feb 12-16, 2006, pp 247 – 252. [12] De Bruijn, B., & Martin, J. (2002). Getting to the (C) Ore of Knowledge: Mining Biomedical Literature. International Journal of Medical Informatics, 67 (1-3), 7-18. [13] Yang, Y. M., & Chute, C. G. (1994). An ExampleBased Mapping Method for Text Categorization and Retrieval. ACM Transactions on Information Systems, 12 (3), 252-277. [14] Tao Liu, Zheng Chen, Benyu Zhang, Wei-ying Ma, Gongyi Wu (2004). Improving Text Classification using Local Latent Semantic Indexing, Data Mining, 2004. ICDM 2004. Proceedings, Fourth IEEE International Conference.

Author Guidelines for 8

nature of surveillance system infrastructure, a number of groups in three ... developed as a Web-portal using the latest text mining .... Nguoi Lao Dong Online.

268KB Sizes 1 Downloads 398 Views

Recommend Documents

Author Guidelines for 8
The resulted Business model offers great ... that is more closely related to the business model of such an .... channels for the same physical (satellite, cable or terrestrial) ... currently under way is the integration of basic Internet access and .

Author Guidelines for 8
three structures came from the way the speaker and channel ... The results indicate that the pairwise structure is the best for .... the NIST SRE 2010 database.

Author Guidelines for 8
replace one trigger with another, for example, interchange between the, this, that is ..... Our own software for automatic text watermarking with the help of our ...

Author Guidelines for 8
these P2P protocols only work in wired networks. P2P networks ... on wired network. For example, IP .... advantages of IP anycast and DHT-based P2P protocol.

Author Guidelines for 8
Instant wireless sensor network (IWSN) is a type of. WSN deployed for a class ... WSNs can be densely deployed in battlefields, disaster areas and toxic regions ...

Author Guidelines for 8
Feb 14, 2005 - between assigned tasks and self-chosen “own” tasks finding that users behave ... fewer queries and different kinds of queries overall. This finding implies that .... The data was collected via remote upload to a server for later ..

Author Guidelines for 8
National Oceanic & Atmospheric Administration. Seattle, WA 98115, USA [email protected] .... space (CSS) representation [7] of the object contour is thus employed. A model set consisting of 3 fish that belong to ... two sets of descending-ordered l

Author Guidelines for 8
Digital circuits consume more power in test mode than in normal operation .... into a signature. Figure 1. A typical ..... The salient features and limitations of the ...

Author Guidelines for 8
idea of fuzzy window is firstly presented, where the similarity of scattering ... For years many approaches have been developed for speckle noise ... only a few typical non- square windows. Moreover, as the window size increases, the filtering perfor

Author Guidelines for 8
Ittiam Systems (Pvt.) Ltd., Bangalore, India. ABSTRACT. Noise in video influences the bit-rate and visual quality of video encoders and can significantly alter the ...

Author Guidelines for 8
to their uniqueness and immutability. Today, fingerprints are most widely used biometric features in automatic verification and identification systems. There exists some graph-based [1,2] and image-based [3,4] fingerprint matching but most fingerprin

Author Guidelines for 8
sequences resulting in a total or partial removal of image motion. ..... Add noise. Add targets. Performance Measurement System. Estimate. Residual offset.

Author Guidelines for 8
application requests without causing severe accuracy and performance degradation, as .... capacity), and (3) the node's location (host address). Each information ... service also sends a message to the meta-scheduler at the initialization stage ...

Author Guidelines for 8
camera's operation and to store the image data to a solid state hard disk drive. A full-featured software development kit (SDK) supports the core acquisition and.

Author Guidelines for 8 - Research at Google
Feb 14, 2005 - engines and information retrieval systems in general, there is a real need to test ... IR studies and Web use investigations is a task-based study, i.e., when a ... education, age groups (18 – 29, 21%; 30 – 39, 38%, 40. – 49, 25%

Author Guidelines for 8
There exists some graph-based [1,2] and image-based [3,4] fingerprint matching but most fingerprint verification systems require high degree of security and are ...

Author Guidelines for 8
Suffering from the inadequacy of reliable received data and ... utilized to sufficiently initialize and guide the recovery ... during the recovery process as follows.

Author Guidelines for 8
smart home's context-aware system based on ontology. We discuss the ... as collecting context information from heterogeneous sources, such as ... create pre-defined rules in a file for context decision ... In order to facilitate the sharing of.

Author Guidelines for 8
affordable tools. So what are ... visualization or presentation domains: Local Web,. Remote Web ... domain, which retrieves virtual museum artefacts from AXTE ...

Author Guidelines for 8
*Department of Computer Science, University of Essex, Colchester, United Kingdom ... with 20 subjects totaling 800 VEP signals, which are extracted while ...

Author Guidelines for 8
that through a data driven approach, useful knowledge can be extracted from this freely available data set. Many previous research works have discussed the.

Author Guidelines for 8
3D facial extraction from volume data is very helpful in ... volume graph model is proposed, in which the facial surface ..... Mathematics and Visualization, 2003.

Author Guidelines for 8
Feb 4, 2010 - adjusted by the best available estimate of the seasonal coefficient ... seeing that no application listens on the port, the host will reply with an ...

Author Guidelines for 8
based systems, the fixed length low-dimension i-vectors are extracted by estimating the latent variables from ... machines (SVM), which are popular in i-vector based SRE system. [4]. The remainder of this paper is .... accounting for 95% of the varia