A Search-based Chinese Word Segmentation Method Xin-Jing Wang, Wen Liu, Yong Qin (IBM China Research Center) Key Idea: 1. Leveraging Web documents and search engines to suggest word segments 2. Scoring the segments 3. Ranking the sequences: applying greedy search based on the scores and selecting the top-ranked sequence as the output segmentation result

Input Chinese sentence s e.g. ”他高兴地说:好的” (He said happily: OK)

Evaluation on SIGHAN’05 MSR data:

sentence: “他高兴地说:好的”

Query: “他高兴地说”

Query: “好的”

Segment s to give clauses set by punctuation e.g. {si} = {“他高兴地说”, “好的”}

Theoretical Justification: w  arg max P(w | s) *

wGEN( s )

 arg max P(w ) P(s | w ) wGEN( s )

 (a)  (b)

Submit si to a search engine and extracts the highlighted pieces of snippets {wi} e.g. {wi} = {“他 ”, “他高兴”, “高兴地”,”高 兴”, “地”, “说”, “地说”}

1. Submit s to a search engine and obtain {wi } ,i.e.GEN(s) 2. P(w) ~ Scoring: assume iid: P(w)  i P(wi ) , P(wi ) : term frequency 3. P(s | w) ~ Ranking: how likely (in which rank) a sequence of n-grams generates the original sentence

Typical Advantages: 1. Free from the Out-of-Vocabulary (OOV) problem, which is a typical advantage of leveraging Web documents. 2. Adaptive to different Chinese word segmentation standards (e.g. MSR, PKU, AS, etc) since ideally we can obtain all valid character sequences by searching the Web. 3. Can be entirely unsupervised which needs no training corpora

1. Training data: 3,000 vs. SIGHAN’05 > 86,000 sentences, Testing data: 4,500 (the whole dataset) 2. Feature set: { term freq., doc freq., length} 3. Conclusions: 1) our performance approaches that of the SIGHAN’05 winner, with much less training data (SVM-based scoring) or none (Freq.-based scoring) 2) Freq.-based method does better than SVM- one, possibly because of the too-simple features 3) the results are biased by Google search engine and there are much room for improvements

Scoring wi

Selecting all valid sequences {wi} and ranking them e.g. {wi} = {[他][高兴地][说], [他][高兴][地][说], [他高兴][地][说],„}

Output the top ranking w* e.g. w* = [他][高兴地][说]

Comparison to IBM FP (dictionary-based) : “他”(he),“他高兴地” (he happily),“说” (say),“高兴” (happy), “地说” (-ly said),“高兴地” (happily)

TF ( w )  (w )    TF (w ) Ni

Freq-based : based

Stf

k 1

i j

i j

k

Ni

j

k 1

k

Or

i j

SVM-

w*  arg max S (wi )  arg max  j S ( wij )

N

Isempty({si}) ? Y

End

wi

wi

Output: [他] [高兴地] [说] [he] [happily] [said]

(1) 我明天要去[止锚湾]玩 FP: 我 明天 要 去 <止 锚 湾> 玩 Our: 我 明天 要 去 [止锚湾] 玩

(2)胡锦涛说[八荣八耻]很重要 FP: 胡锦涛 说 <八 荣 八 耻> 很重要 Our: 胡锦涛 说 [八荣八耻] 很重要

(3) 老百姓[有苦难言] FP: 老百姓 <有 苦难 言> Our: 老百姓 [有苦难言]

(4) 有职称的和[尚未]有职称的 FP: 有 职称 的 <和尚 未有> 职称 的 Our: 有 职称 的 和 [尚未] 有 职称 的

(1) 止锚湾 (Zhi Mao Bay): a location name (2) 八荣八耻(Eight-Honors-and-Eight-Disgraces): a newly proposed social sense) (3) 有苦难言 (unable to tell the sufferings): an idiom which contains another word “苦难” (tribulation) (4) 尚未 (not yet): ambiguous phrase, if preceded by “和“ (and), it can be “和尚“ (monk) as FP outputs

A Search-based Chinese Word Segmentation Method

A Search-based Chinese Word Segmentation Method ... Submit s to a search engine and obtain ,i.e.. 2. ... Free from the Out-of-Vocabulary (OOV) problem,.

530KB Sizes 7 Downloads 296 Views

Recommend Documents

A Search-based Chinese Word Segmentation Method
coverage of the dictionary, and drops sharply as new words appear. ... 2.2.2 SVM-based. This method uses SVM classifier with RBF kernel and maps the ... also compare to IBM full-parser, a state-of-the-art dictionary- based method adopting ...

Do We Need Chinese Word Segmentation for Statistical ...
the Viterbi alignment of the final training iter- ation for .... black boxes show the Viterbi alignment for this sentence ..... algorithm for word segmentation. In Proc.

Chinese Word Segmentation and POS Tagging - Research at Google
tation guidelines or standards. This seems to be a great waste of human efforts, and it would be nice to automatically adapt one annotation standard to another.

Segmentation of Connected Chinese Characters Based ... - CiteSeerX
State Key Lab of Intelligent Tech. & Sys., CST ... Function to decide which one is the best among all ... construct the best segmentation path, genetic algorithm.

Sentence Segmentation Using IBM Word ... - Semantic Scholar
contains the articles from the Xinhua News Agency. (LDC2002E18). This task has a larger vocabulary size and more named entity words. The free parameters are optimized on the devel- opment corpus (Dev). Here, the NIST 2002 test set with 878 sentences

A Segmentation Method to Improve Iris-based Person ...
unique digital signature for an individual. As a result, the stability and integrity of a system depends on effective segmentation of the iris to generate the iris-code.

A new handwritten character segmentation method ...
Mar 29, 2012 - Telecommunications and Pattern Recognition and Intelligent. System Laboratory. The database contains 3755 frequently used simplified ...

Call Transcript Segmentation Using Word ...
form topic segmentation of call center conversational speech. This model is ... in my laptop' and 'my internet connection' based on the fact that word pairs ...

Protein Word Detection using Text Segmentation Techniques
Aug 4, 2017 - They call the short consequent sequences (SCS) present in ..... In Proceedings of the Joint Conference of the 47th ... ACM SIGMOBILE Mobile.

Focused Word Segmentation for ASR
upper left cloud corresponds to the samples from clean speech while the ... space. It can be seen that the separation is language indepen- dent. 3.5. 4. 4.5. 5. 5.5. 6 ..... MVSE-Model degrades more rapidly in comparison to the B-. Model and ...

A geodesic voting method for the segmentation of tubular ... - Ceremade
This paper presents a geodesic voting method to segment tree structures, such as ... The vascular tree is a set of 4D minimal paths, giving 3D cen- terlines and ...

A geodesic voting method for the segmentation of tubular ... - Ceremade
branches, but it does not allow to extract the tubular aspect of the tree. Furthermore .... This means at each pixel the density of geodesics that pass over ... as threshold to extract the tree structure using the voting maps. Figure 1 (panel: second

A geodesic voting method for the segmentation of ...
used to extract the tubular aspect of the tree: surface models; centerline based .... The result of this voting scheme is what we can call the geodesic density. ... the left panel shows the geodesic density; the center panel shows the geodesic den-.

A Realistic and Robust Model for Chinese Word ...
In addition, when applied to SigHAN Bakeoff 3 competition data, the .... disadvantages are big memory and computational time requirement. 3. Model ..... Linguistics Companion Volume Proceedings of the Demo and Poster Sessions,.

Word Segmentation for the Myanmar Language
When a query is submitted to a search engine, key words of the query are compared against the indexed .... Of these 52 errors, 22 were proper names and foreign words written in Thai. ... Even though its alphabet (called quoc ngu) is based on the Lati

Vietnamese Word Segmentation with CRFs and SVMs
Word segmentation for Vietnamese, like for most Asian languages, is an ..... from the training data, thus we design 5 separate experiments for CRFs- the later is ...

Protein Word Detection using Text Segmentation ... - Research
Aug 4, 2017 - to the task of extracting ”biological words” from protein ... knowledge about protein words, we propose to ..... A tutorial introduction to the.

Bayesian Method for Motion Segmentation and ...
ticularly efficient to analyse and track motion segments from the compression- ..... (ISO/IEC 14496 Video Reference Software) Microsoft-FDAM1-2.3-001213.

NOVEL METHOD FOR SAR IMAGE SEGMENTATION ...
1. INTRODUCTION. With the emergency of well-developed Synthetic Aperture. Radar (SAR) technologies, SAR image processing techniques have gained more and more attention in recent years, e.g., target detection, terrain classification and etc. As a typi

Efficient Method for Brain Tumor Segmentation using ...
Apr 13, 2007 - This paper works on the concept of segmentation based on grey levels. It proposes a new entropy method for MRI images. The segmentation is done using ABC algorithm and the method is used to search the value in continuous gray scale int

An Effective Segmentation Method for Iris Recognition System
Biometric identification is an emerging technology which gains more attention in recent years. ... characteristics, iris has distinct phase information which spans about 249 degrees of freedom [6,7]. This advantage let iris recognition be the most ..

Chinese Dragons Word Count 297.pdf
The Reading Sage Sean Taylor INTENSIVE Reading Boot Camp. Page 1 of 1. Chinese Dragons Word Count 297.pdf. Chinese Dragons Word Count 297.pdf.

Multi-word Unit Alignment in English-Chinese Parallel ... - CiteSeerX
For 3,142 windows matched in a million-words English-Spanish parallel corpus ..... b) Chinese n-grams containing “VV0” (verbs), “VM” (modal verbs) are filtered,.

Chinese Dragons Word Count 297.pdf
Chinese Dragons Word Count 297.pdf. Chinese Dragons Word Count 297.pdf. Open. Extract. Open with. Sign In. Main menu. Displaying Chinese Dragons Word ...