Learning Semantic Correspondences with Less ...

Viewer
Transcript

Vietnamese POS Tagging for Social Media Text Ngo Xuan Bach+* Nguyen Dieu Linh+ Tu Minh Phuong+* + Posts and Telecommunications Institute of Technology, Vietnam * FPT Software Research Lab, Vietnam

ICONIP2016, Kyoto Japan, October 2016

Part-of-Speech (POS) Tagging The process of assigning to each word in a sentence the proper POS tag in the context it appears



o o

Input: Book that flight . Output: Book/VB that/DT flight/NN ./.

A fundamental task in natural language processing (NLP)



o

Provides useful information for many other NLP tasks ▪

Word sense disambiguation, text chunking, named entity recognition, syntactic parsing, semantic role labeling, semantic parsing, and so on

Challenges



o

2

How to find POS tags of new words and how to disambiguate multi-sense words Ngo Xuan Bach

POS Tagging Has been studied intensively for several decades



o o

o o

English (Brill, 1995; Ratnaparkhi, 1996; Toutanova et al., 2003) Japanese (Nakagawa et al., 2002; Nakagawa and Uchimoto, 2007) Arabic (Aldarmaki and Diab, 2015) Vietnamese (Nghiem et al., 2008; Tran et al., 2009; Bach et al., 2013)

State-of-the-art POS taggers are statistical or machine learning based models trained on annotated corpora of conventional text



o

o o

3

Penn Treebank for English Kyoto corpus for Japanese Viet Treebank for Vietnamese Ngo Xuan Bach

Social Media Text 

Web 2.0 platforms such as blogs, forums, wikis, and social networks have facilitated the generation of a huge volume of user-generated text



Have become an important source for both data mining and NLP communities Require appropriate tools for text analysis

 4

Ngo Xuan Bach

POS Tagging for Social Media Text Social media text poses several challenges Facebook Sentences

Expected Sentences

Em đọc đc ấy mà a

Em đọc được ấy mà anh

I can read it

abbreviation

Nó good vậyyyy

Nó giỏi vậy

He is so good

foreign word, typo

Ng đàn ông mặc áo Người đàn ông đen cơ. :)) mặc áo đen cơ. :))

Must be the guy with a black shirt

abbreviation, emoticon

Toi thich cái màu trắng

I like the white one

word without tone mark

Tôi thích cái màu trắng

Translated Sentences

Problems

A POS tagger developed for conventional, edited text would perform poorly on such noisy data 5

Ngo Xuan Bach

This Work We develop a Vietnamese tagging model with various types of linguistic features and a new POS tagset



o

empirically show the effectiveness of the method on data from Vietnamese Facebook

We construct an annotated corpus for Vietnamese POS tagging



o

consisting of 4150 sentences collected from Facebook

Both annotated corpus and trained POS tagger are made available to the research community



6

Ngo Xuan Bach

Outline Introduction Tagging Method Experiments Summary

  



7

Ngo Xuan Bach

A POS Tagset for Social Media Text We extended the conventional POS tagset to cover the variations of social media text



8

Ngo Xuan Bach

Annotation Procedure We extracted textual content of posts and comments from Vietnamese Facebook



Raw text

Preprocessing

Automatic tagging

-Split sentences -Remove noisy sentences

-Tag sentences using vnTagger

Manual tagging -Two annotators manually correct sentences

Corpus Annotated corpus

We used the Cohen’s kappa coefficient to measure the inter-annotator agreement



o

9

The score was 0.84, which can be interpreted as almost perfect agreement Ngo Xuan Bach

Statistical Information of the Corpus #sentences #unique words 4150

6416

% 20

#tokens

#tokens/sentence

38498

9.3

POS tag distribution

19.2 18.46

18 16 14 12 10

8.86 8.04 6.68 5.89 5.23

8 6 4

3.8 3.543.52

3.17

2 0

2.412.36 1.751.551.311.15 0.930.790.490.32 0.3 0.210.05

POS tags V

10

N

PUN

R

A

P

AB

E

Np

T

C

M

HP

X

I

FL

Nc

L

CF

CC

Nu

SD

IL

AR

Ngo Xuan Bach

Tagging Model

Corpus

Feature Extraction

Conditional Random Fields

Tagging Model

Feature Type

Description

Basic features

unigrams, bigrams, and trigrams of words

Enhanced features

special character, icon or emoticon, digits, capitalization, hashtags and URLs

METAPH feature

used the Metaphone algorithm to create a coarse phonetic normalization of words to simpler keys

GENTAG features

the output (the predicted POS tags) of vnTagger - Unigrams, bigrams, trigrams of POS tags

11

Ngo Xuan Bach

Experiments

12

Ngo Xuan Bach

Experimental Setup  

Conduced 10-fold cross-validation Measured the performance by Accuracy, Precision, Recall, and the F1 score

13

Model

Characteristics

Baseline1

used he output of vnTagger

Baseline 2

used a list of icons to automatically correct the output of vnTagger

CRF1

used CRFs with basic features

CRF2

CRF1 + enhanced features

CRF3

CRF2 + METAPH features

CRF4

CRF3 + features from Baseline1

CRF5

CRF3 + features from Baseline2 Ngo Xuan Bach

Tagging Accuracy % 90

88.16

88 86 83.53

84 82

80.69

88.26

83.86

81.39

80 78

76.99

76

74 72 70

Tagging Accuracy

Baseline 1

14

Baseline 2

CRF1

CRF2

CRF3

CRF4

CRF5

Ngo Xuan Bach

Tagging Results in Detail

15

Ngo Xuan Bach

Confused Tags

16

Ngo Xuan Bach

Summary

17

Ngo Xuan Bach

Summary 

We have developed a Vietnamese part-of-speech tagging system for social media text o o



an annotated corpus from Facebook outperformed a state-of-the-art Vietnamese POS tagger trained on general text by a large margin

The tagger as well as the annotated data can be useful for further research not only on POS tagging but also other NLP tasks for Vietnamese social media text

18

Ngo Xuan Bach

19

Ngo Xuan Bach