A Search-based Chinese Word Segmentation Method Xin-Jing Wang, Wen Liu, Yong Qin (IBM China Research Center) Key Idea: 1. Leveraging Web documents and search engines to suggest word segments 2. Scoring the segments 3. Ranking the sequences: applying greedy search based on the scores and selecting the top-ranked sequence as the output segmentation result

Input Chinese sentence s e.g. ”他高兴地说:好的” (He said happily: OK)

Evaluation on SIGHAN’05 MSR data:

sentence: “他高兴地说:好的”

Query: “他高兴地说”

Query: “好的”

Segment s to give clauses set by punctuation e.g. {si} = {“他高兴地说”, “好的”}

Theoretical Justification: w  arg max P(w | s) *

wGEN( s )

 arg max P(w ) P(s | w ) wGEN( s )

 (a)  (b)

Submit si to a search engine and extracts the highlighted pieces of snippets {wi} e.g. {wi} = {“他 ”, “他高兴”, “高兴地”,”高 兴”, “地”, “说”, “地说”}

1. Submit s to a search engine and obtain {wi } ,i.e.GEN(s) 2. P(w) ~ Scoring: assume iid: P(w)  i P(wi ) , P(wi ) : term frequency 3. P(s | w) ~ Ranking: how likely (in which rank) a sequence of n-grams generates the original sentence

Typical Advantages: 1. Free from the Out-of-Vocabulary (OOV) problem, which is a typical advantage of leveraging Web documents. 2. Adaptive to different Chinese word segmentation standards (e.g. MSR, PKU, AS, etc) since ideally we can obtain all valid character sequences by searching the Web. 3. Can be entirely unsupervised which needs no training corpora

1. Training data: 3,000 vs. SIGHAN’05 > 86,000 sentences, Testing data: 4,500 (the whole dataset) 2. Feature set: { term freq., doc freq., length} 3. Conclusions: 1) our performance approaches that of the SIGHAN’05 winner, with much less training data (SVM-based scoring) or none (Freq.-based scoring) 2) Freq.-based method does better than SVM- one, possibly because of the too-simple features 3) the results are biased by Google search engine and there are much room for improvements

Scoring wi

Selecting all valid sequences {wi} and ranking them e.g. {wi} = {[他][高兴地][说], [他][高兴][地][说], [他高兴][地][说],„}

Output the top ranking w* e.g. w* = [他][高兴地][说]

Comparison to IBM FP (dictionary-based) : “他”(he),“他高兴地” (he happily),“说” (say),“高兴” (happy), “地说” (-ly said),“高兴地” (happily)

TF ( w )  (w )    TF (w ) Ni

Freq-based : based


k 1

i j

i j




k 1



i j


w*  arg max S (wi )  arg max  j S ( wij )


Isempty({si}) ? Y




Output: [他] [高兴地] [说] [he] [happily] [said]

(1) 我明天要去[止锚湾]玩 FP: 我 明天 要 去 <止 锚 湾> 玩 Our: 我 明天 要 去 [止锚湾] 玩

(2)胡锦涛说[八荣八耻]很重要 FP: 胡锦涛 说 <八 荣 八 耻> 很重要 Our: 胡锦涛 说 [八荣八耻] 很重要

(3) 老百姓[有苦难言] FP: 老百姓 <有 苦难 言> Our: 老百姓 [有苦难言]

(4) 有职称的和[尚未]有职称的 FP: 有 职称 的 <和尚 未有> 职称 的 Our: 有 职称 的 和 [尚未] 有 职称 的

(1) 止锚湾 (Zhi Mao Bay): a location name (2) 八荣八耻(Eight-Honors-and-Eight-Disgraces): a newly proposed social sense) (3) 有苦难言 (unable to tell the sufferings): an idiom which contains another word “苦难” (tribulation) (4) 尚未 (not yet): ambiguous phrase, if preceded by “和“ (and), it can be “和尚“ (monk) as FP outputs

A Search-based Chinese Word Segmentation Method

A Search-based Chinese Word Segmentation Method ... Submit s to a search engine and obtain ,i.e.. 2. ... Free from the Out-of-Vocabulary (OOV) problem,.

530KB Sizes 7 Downloads 114 Views

Recommend Documents

Do We Need Chinese Word Segmentation for Statistical ...
the Viterbi alignment of the final training iter- ation for .... black boxes show the Viterbi alignment for this sentence ..... algorithm for word segmentation. In Proc.

Chinese Word Segmentation and POS Tagging - Research at Google
tation guidelines or standards. This seems to be a great waste of human efforts, and it would be nice to automatically adapt one annotation standard to another.

Segmentation of Connected Chinese Characters Based ... - CiteSeerX
State Key Lab of Intelligent Tech. & Sys., CST ... Function to decide which one is the best among all ... construct the best segmentation path, genetic algorithm.

A Segmentation Method to Improve Iris-based Person ...
unique digital signature for an individual. As a result, the stability and integrity of a system depends on effective segmentation of the iris to generate the iris-code.

Protein Word Detection using Text Segmentation Techniques
Aug 4, 2017 - They call the short consequent sequences (SCS) present in ..... In Proceedings of the Joint Conference of the 47th ... ACM SIGMOBILE Mobile.

A geodesic voting method for the segmentation of ...
used to extract the tubular aspect of the tree: surface models; centerline based .... The result of this voting scheme is what we can call the geodesic density. ... the left panel shows the geodesic density; the center panel shows the geodesic den-.

A Realistic and Robust Model for Chinese Word ...
In addition, when applied to SigHAN Bakeoff 3 competition data, the .... disadvantages are big memory and computational time requirement. 3. Model ..... Linguistics Companion Volume Proceedings of the Demo and Poster Sessions,.

Word Segmentation for the Myanmar Language
When a query is submitted to a search engine, key words of the query are compared against the indexed .... Of these 52 errors, 22 were proper names and foreign words written in Thai. ... Even though its alphabet (called quoc ngu) is based on the Lati

Bayesian Method for Motion Segmentation and ...
ticularly efficient to analyse and track motion segments from the compression- ..... (ISO/IEC 14496 Video Reference Software) Microsoft-FDAM1-2.3-001213.

1. INTRODUCTION. With the emergency of well-developed Synthetic Aperture. Radar (SAR) technologies, SAR image processing techniques have gained more and more attention in recent years, e.g., target detection, terrain classification and etc. As a typi

Chinese Dragons Word Count 297.pdf
The Reading Sage Sean Taylor INTENSIVE Reading Boot Camp. Page 1 of 1. Chinese Dragons Word Count 297.pdf. Chinese Dragons Word Count 297.pdf.