A Search-based Chinese Word Segmentation Method Xin-Jing Wang

Wen Liu

Yong Qin

IBM China Research Center Beijing, China

Huazhong Univ. of Sci& Tech. Wuhan, China

IBM China Research Center Beijing, China

[email protected]

[email protected]

[email protected]

ABSTRACT

In this paper, we propose a novel Chinese word segmentation method which leverages the huge deposit of Web documents and search technology. It simultaneously solves ambiguous phrase boundary resolution and unknown word identification problems. Evaluations prove its effectiveness.

Categories and Subject Descriptors

I.2.7. [Artificial Intelligence]: Natural Language Processing– Language parsing and understanding. H.3.1. [Information Storage and Retrieval]: Content Analysis and Indexing – Linguistic processing.

General Terms: Performance, Algorithms. Keywords: Chinese word segmentation, search. 1. INTRODUCTION

Automatic Chinese word segmentation is an important technique for many areas including speech synthesis, text categorization, etc [3]. It is challenging because 1) there is no standard definition of words in Chinese, 2) word boundaries are not marked by spaces. Two research issues are mainly involved: ambiguous phrase boundary resolution and unknown word identification. Previous approaches fall roughly into four categories: 1) Dictionary-based methods, which segment sentences by matching entries in a dictionary [3]. Its accuracy is determined by the coverage of the dictionary, and drops sharply as new words appear. 2) Statistical machine learning methods [1], which are typically based on co-occurrences of character sequences. Generally large annotated Chinese corpora are required for model training, and they lack the flexibility to adapt to different segmentation standards. 3) Transformation-based methods [4]. They are initially used in POS tagging and parsing, which learn a set of n-gram rules from a training corpus and then apply them to the new text. 4) Combining methods [3] which combine two or more of the above methods. As the Web prospers, it brings new opportunities to solve many previously "unsolvable" problems. In this paper, we propose to leverage the Web and search technology to segment Chinese words. Its typical advantages include: 1) Free from the Out-of-Vocabulary (OOV) problem, and this is a typical feature of leveraging the Web documents. 2) Adaptive to different segmentation standards since ideally we can obtain all valid character sequences by searching the Web. 3) Can be entirely unsupervised that need no training corpora. Copyright is held by the author/owner(s). WWW 2007, May 8–12, 2007, Banff, Alberta, Canada. ACM 978-1-59593-654-7/07/0005.

2. THE PROPOSED APPROACH

The approach contains three steps: 1) segments collecting, 2) segments scoring, and 3) segmentation scheme ranking.

2.1 Segments Collecting

The segments are collected in two steps: 1) Firstly, the query sentence is semantically segmented by punctuation which gives several sub-sentences. 2) Then each sub-sentence is submitted to a search engine for segments collecting. Technically, if the search engine’s inverted indices are inaccessible as commercial search engines do, e.g. Google and Yahoo!, we collect the highlights (the red words in Figure 1) from the returned snippets as the segments. Otherwise, we check the characters’ positions indicated by the inverted indices and find those that neighbor each other in the query. Although search engines generally have local segmentors, we argue that their performance normally will not affect our results, e.g. Figure 1 shows the search results of “ ” (he said happily), our method assumes that the highlight (he happily) is a segment. However, by checking the HTML source, we found that Yahoo!’s local segmentor gives “ ”, which cut it into three segments. Consider an extreme case that the local segmentor segments each sentences into unigrams, intuitively, segments collected will still be n-grams since the unigrams neighbor each other in the retrieved documents as they are written in natural language. This shows that our results are generally independent to search engines’ local segmentors.

  







2.2 Segments Scoring

Each segment is scored so that we can select a subset of segments as the final segmentation which, when reconstructing the query, scores the highest. Obviously various methods can be used. Here we try two of them, namely frequency-based and Support Vector Machine (SVM)-based method.

2.2.1 Frequency-based

This method uses term frequency as the scoring function, which is defined as the ratio of the number of occurrences of the segment to the total number of occurrences of all the segments.

2.2.2 SVM-based

This method uses SVM classifier with RBF kernel and maps the outputs into probabilities as the scores [2].

2.3 Segments Selecting

We call a subset “valid” if its member segments can reconstruct exactly the query, and the score of a valid subset is the average score of its member segments. We select the valid subset which scores the highest as the final segmentation. For efficiency consideration, we use greedy search rather than dynamic programming to find valid subsets.

LMF *& (&'/0*'.0, H GJGKI &*'/-*, E F &('.&*'+)+, &('()

"!#$%

Figure 1. Yahoo! search result of “ ” (He said happily). Red words are the segments.

We evaluate our method on the benchmark MSR dataset provided by SIGHAN’05 workshop (www.sighan.org/bakeoff2005/) and also compare to IBM full-parser, a state-of-the-art dictionarybased method adopting maximum matching strategy.

3.1 Evaluation on SIGHAN’05 Benchmark Data

Figure 2 shows the performance of our approach which is output by SIGHAN’05 benchmark evaluation. The dotted and blocked columns correspond to frequency- and SVM-based approaches separately. Although they are worse than those reported by SIGHAN’05, the approach is effective because we used only 3,000 training sentences (in the case of SVM-based method) while SIGHAN’05 groups used about 86,000. Moreover, out method avoids OOV problem. Interestingly, frequency-based method performs better than SVMbased method in precision and F-measure. A possible reason is that the feature space is too simple to fully describe the data, so that the power of SVM models was not fully taken advantages of. We argue that a better performance can be achieved with more search results provided. Since currently only Google search is used and it returns only about 800 snippets whose highlighted character sequences (i.e. segments) are generally long and contain multiple semantic concepts due to the great search power of Google, these limit the effectiveness of the segments extracted. In fact, based on a rough evaluation, much better performance can be achieved if we combine search results of Yahoo! and Google. However, since Yahoo! prohibits frequent query (to prevent DDOS attack), we were not able to collect enough training data from Yahoo!, but it inspires us that with a local search engine and a large document set, we can expect a much better performance.

3.2 Comparison to IBM Full-parser (FP)

Figure 3 gives examples of the comparison results between our method and IBM Full-Parser, which show four cases that our method is superior to the dictionary-based methods. The correct segmentation is boldfaced, and “<>” and “[]” quoted character sequences show separately the wrong and correct output by IBM Full-parser and our method.

 ” 

The first two examples contain one location name “ (Zhimao Bay) and a Chinese newly proposed social sense “

132+4.576+6

8(132+439.:*9<;+=

>.?.@32+5*:BA(1*2 4.1C9BD(2/1C9B5

Figure 2. Evaluation on SIGHAN’05 data with the two different segment scoring methods

3. EVALUATIONS

The training data used is 3,000 randomly selected sentences (Note that in the case of using frequency-based scoring function, our method needs no training and is unsupervised segmentation). And the entire testing dataset (about 4,500 sentences) is used for testing. The feature space is three-dimensional: {TF, DF, LEN}. TF is defined as in Section 2.2.1. DF is the number of documents indexed by a segment, and LEN indicates the number of characters in a segment.

N+O*[3P+V+Q.b(R3c+P/S(NeT(d*U.V(P.O+W+S*X3P3Y+P.fhZ\g7[7P.]+P+^(].^(_+Z_(Z `/a

Figure 3. Examples of the superiority to IBM Full-Parser



” (Eight-Honors-and-Eight-Disgraces) which are not included in FP’s corpus, thus it separates the two proper nouns as independent characters. Example (3) has an idiom which contains a phrase, “ ” (tribulation), that happens to be an entry in FP’s corpus, hence FP separates this idiom into three words. Example (4) shows an ambiguous query “ ”. It can either be parsed as “[ (monk)] [ (has not)]” or “[ (and)] [ (not yet)] [ (have)]” if no context information (here is “ ” (technical title)) is given. Since FP adopts the maximum matching strategy and “ ” (monk) is also an entry in its corpus, it takes the former segmentation. Contrarily, leveraging document information and search technology, the context information “ ” is taken into consideration which directs us to select the latter and correct segmentation, as monks never have technical titles.

 





 





 





4. CONCLUSIONS

Chinese word segmentation is a widely requested Chinese information processing step. In this paper, we propose a novel solution which leverages the Web data and search technology. It contains three steps: 1) collecting segments from search results, 2) scoring segments, and 3) ranking segmentations. It is good at discovering new words (no OOV problem) and adapting to different segmentation standards, and can be entirely unsupervised which saves labors to labeling training data. There are many possible future works, such as finding more effective scoring methods, combining current approach to other types of segmentation methods to give a better performance, etc.

5. REFERENCES

[1] Gao, J.F., Li, M., Wu A., Huang, C.N. Chinese Word

Segmentation and Named Entity Recognition: A Pragmatic Approach. Computational Linguistics. MIT Press. 2005.

[2] Platt, J., Probabilistic outputs for support vector machines

and comparisons to regularized likelihood methods. Large Margin Classifiers, MIT Press, 1999.

[3] Sproat, R., and Shih C. Corpus-based Methods in Chinese Morphology and Phonology. COOLING, 2002.

[4] Xue, N.W. Chinese Word Segmentation as Character

Tagging. Computational Linguistics and Chinese Language Processing. Vol. 8, No. 1, Feb. 2003, pp.29-48.

A Search-based Chinese Word Segmentation Method

coverage of the dictionary, and drops sharply as new words appear. ... 2.2.2 SVM-based. This method uses SVM classifier with RBF kernel and maps the ... also compare to IBM full-parser, a state-of-the-art dictionary- based method adopting ...

124KB Sizes 2 Downloads 270 Views

Recommend Documents

A Search-based Chinese Word Segmentation Method
A Search-based Chinese Word Segmentation Method ... Submit s to a search engine and obtain ,i.e.. 2. ... Free from the Out-of-Vocabulary (OOV) problem,.

Do We Need Chinese Word Segmentation for Statistical ...
the Viterbi alignment of the final training iter- ation for .... black boxes show the Viterbi alignment for this sentence ..... algorithm for word segmentation. In Proc.

Chinese Word Segmentation and POS Tagging - Research at Google
tation guidelines or standards. This seems to be a great waste of human efforts, and it would be nice to automatically adapt one annotation standard to another.

Segmentation of Connected Chinese Characters Based ... - CiteSeerX
State Key Lab of Intelligent Tech. & Sys., CST ... Function to decide which one is the best among all ... construct the best segmentation path, genetic algorithm.

Sentence Segmentation Using IBM Word ... - Semantic Scholar
contains the articles from the Xinhua News Agency. (LDC2002E18). This task has a larger vocabulary size and more named entity words. The free parameters are optimized on the devel- opment corpus (Dev). Here, the NIST 2002 test set with 878 sentences

A Segmentation Method to Improve Iris-based Person ...
unique digital signature for an individual. As a result, the stability and integrity of a system depends on effective segmentation of the iris to generate the iris-code.

A new handwritten character segmentation method ...
Mar 29, 2012 - Telecommunications and Pattern Recognition and Intelligent. System Laboratory. The database contains 3755 frequently used simplified ...

Call Transcript Segmentation Using Word ...
form topic segmentation of call center conversational speech. This model is ... in my laptop' and 'my internet connection' based on the fact that word pairs ...

Protein Word Detection using Text Segmentation Techniques
Aug 4, 2017 - They call the short consequent sequences (SCS) present in ..... In Proceedings of the Joint Conference of the 47th ... ACM SIGMOBILE Mobile.

Focused Word Segmentation for ASR
upper left cloud corresponds to the samples from clean speech while the ... space. It can be seen that the separation is language indepen- dent. 3.5. 4. 4.5. 5. 5.5. 6 ..... MVSE-Model degrades more rapidly in comparison to the B-. Model and ...

A geodesic voting method for the segmentation of tubular ... - Ceremade
This paper presents a geodesic voting method to segment tree structures, such as ... The vascular tree is a set of 4D minimal paths, giving 3D cen- terlines and ...

A geodesic voting method for the segmentation of tubular ... - Ceremade
branches, but it does not allow to extract the tubular aspect of the tree. Furthermore .... This means at each pixel the density of geodesics that pass over ... as threshold to extract the tree structure using the voting maps. Figure 1 (panel: second

A geodesic voting method for the segmentation of ...
used to extract the tubular aspect of the tree: surface models; centerline based .... The result of this voting scheme is what we can call the geodesic density. ... the left panel shows the geodesic density; the center panel shows the geodesic den-.

A Realistic and Robust Model for Chinese Word ...
In addition, when applied to SigHAN Bakeoff 3 competition data, the .... disadvantages are big memory and computational time requirement. 3. Model ..... Linguistics Companion Volume Proceedings of the Demo and Poster Sessions,.

Word Segmentation for the Myanmar Language
When a query is submitted to a search engine, key words of the query are compared against the indexed .... Of these 52 errors, 22 were proper names and foreign words written in Thai. ... Even though its alphabet (called quoc ngu) is based on the Lati

Vietnamese Word Segmentation with CRFs and SVMs
Word segmentation for Vietnamese, like for most Asian languages, is an ..... from the training data, thus we design 5 separate experiments for CRFs- the later is ...

Protein Word Detection using Text Segmentation ... - Research
Aug 4, 2017 - to the task of extracting ”biological words” from protein ... knowledge about protein words, we propose to ..... A tutorial introduction to the.

Bayesian Method for Motion Segmentation and ...
ticularly efficient to analyse and track motion segments from the compression- ..... (ISO/IEC 14496 Video Reference Software) Microsoft-FDAM1-2.3-001213.

NOVEL METHOD FOR SAR IMAGE SEGMENTATION ...
1. INTRODUCTION. With the emergency of well-developed Synthetic Aperture. Radar (SAR) technologies, SAR image processing techniques have gained more and more attention in recent years, e.g., target detection, terrain classification and etc. As a typi

Efficient Method for Brain Tumor Segmentation using ...
Apr 13, 2007 - This paper works on the concept of segmentation based on grey levels. It proposes a new entropy method for MRI images. The segmentation is done using ABC algorithm and the method is used to search the value in continuous gray scale int

An Effective Segmentation Method for Iris Recognition System
Biometric identification is an emerging technology which gains more attention in recent years. ... characteristics, iris has distinct phase information which spans about 249 degrees of freedom [6,7]. This advantage let iris recognition be the most ..

Chinese Dragons Word Count 297.pdf
The Reading Sage Sean Taylor INTENSIVE Reading Boot Camp. Page 1 of 1. Chinese Dragons Word Count 297.pdf. Chinese Dragons Word Count 297.pdf.

Multi-word Unit Alignment in English-Chinese Parallel ... - CiteSeerX
For 3,142 windows matched in a million-words English-Spanish parallel corpus ..... b) Chinese n-grams containing “VV0” (verbs), “VM” (modal verbs) are filtered,.

Chinese Dragons Word Count 297.pdf
Chinese Dragons Word Count 297.pdf. Chinese Dragons Word Count 297.pdf. Open. Extract. Open with. Sign In. Main menu. Displaying Chinese Dragons Word ...