A Search-based Chinese Word Segmentation Method

Viewer
Transcript

A Search-based Chinese Word Segmentation Method Xin-Jing Wang, Wen Liu, Yong Qin (IBM China Research Center) Key Idea: 1. Leveraging Web documents and search engines to suggest word segments 2. Scoring the segments 3. Ranking the sequences: applying greedy search based on the scores and selecting the top-ranked sequence as the output segmentation result

Input Chinese sentence s e.g. ”他高兴地说:好的” (He said happily: OK)

Evaluation on SIGHAN’05 MSR data:

sentence: “他高兴地说：好的”

Query: “他高兴地说”

Query: “好的”

Segment s to give clauses set by punctuation e.g. {si} = {“他高兴地说”, “好的”}

Theoretical Justification: w  arg max P(w | s) *

wGEN( s )

 arg max P(w ) P(s | w ) wGEN( s )

 (a)  (b)

Submit si to a search engine and extracts the highlighted pieces of snippets {wi} e.g. {wi} = {“他　”, “他高兴”, “高兴地”,”高兴”, “地”, “说”, “地说”}

1. Submit s to a search engine and obtain {wi } ,i.e.GEN(s) 2. P(w) ~ Scoring: assume iid: P(w)  i P(wi ) , P(wi ) : term frequency 3. P(s | w) ~ Ranking: how likely (in which rank) a sequence of n-grams generates the original sentence

Typical Advantages: 1. Free from the Out-of-Vocabulary (OOV) problem, which is a typical advantage of leveraging Web documents. 2. Adaptive to different Chinese word segmentation standards (e.g. MSR, PKU, AS, etc) since ideally we can obtain all valid character sequences by searching the Web. 3. Can be entirely unsupervised which needs no training corpora

1. Training data: 3,000 vs. SIGHAN’05 > 86,000 sentences, Testing data: 4,500 (the whole dataset) 2. Feature set: { term freq., doc freq., length} 3. Conclusions: 1) our performance approaches that of the SIGHAN’05 winner, with much less training data (SVM-based scoring) or none (Freq.-based scoring) 2) Freq.-based method does better than SVM- one, possibly because of the too-simple features 3) the results are biased by Google search engine and there are much room for improvements

Scoring wi

Selecting all valid sequences {wi} and ranking them e.g. {wi} = {[他][高兴地][说], [他][高兴][地][说], [他高兴][地][说],„}

Output the top ranking w* e.g. w* = [他][高兴地][说]

Comparison to IBM FP (dictionary-based) : “他”(he)，“他高兴地” (he happily)，“说” (say)，“高兴” (happy)， “地说” (-ly said)，“高兴地” (happily)

TF ( w )  (w )    TF (w ) Ni

Freq-based : based

Stf

k 1

i j

i j

k

Ni

j

k 1

k

Or

i j

SVM-

w*  arg max S (wi )  arg max  j S ( wij )

N

Isempty({si}) ? Y

End

wi

wi

Output: [他] [高兴地] [说] [he] [happily] [said]

(1) 我明天要去[止锚湾]玩 FP: 我明天要去 <止锚湾> 玩 Our: 我明天要去 [止锚湾] 玩

(2)胡锦涛说[八荣八耻]很重要 FP: 胡锦涛说 <八荣八耻> 很重要 Our: 胡锦涛说 [八荣八耻] 很重要

(3) 老百姓[有苦难言] FP: 老百姓 <有苦难言> Our: 老百姓 [有苦难言]

(4) 有职称的和[尚未]有职称的 FP: 有职称的 <和尚未有> 职称的 Our: 有职称的和 [尚未] 有职称的

(1) 止锚湾 (Zhi Mao Bay): a location name (2) 八荣八耻(Eight-Honors-and-Eight-Disgraces): a newly proposed social sense) (3) 有苦难言 (unable to tell the sufferings): an idiom which contains another word “苦难” (tribulation) (4) 尚未 (not yet): ambiguous phrase, if preceded by “和“ (and), it can be “和尚“ (monk) as FP outputs