Vietnamese POS Tagging for Social Media Text Ngo Xuan Bach+* Nguyen Dieu Linh+ Tu Minh Phuong+* + Posts and Telecommunications Institute of Technology, Vietnam * FPT Software Research Lab, Vietnam

ICONIP2016, Kyoto Japan, October 2016

Part-of-Speech (POS) Tagging The process of assigning to each word in a sentence the proper POS tag in the context it appears



o o

Input: Book that flight . Output: Book/VB that/DT flight/NN ./.

A fundamental task in natural language processing (NLP)



o

Provides useful information for many other NLP tasks ▪

Word sense disambiguation, text chunking, named entity recognition, syntactic parsing, semantic role labeling, semantic parsing, and so on

Challenges



o

2

How to find POS tags of new words and how to disambiguate multi-sense words Ngo Xuan Bach

POS Tagging Has been studied intensively for several decades



o o

o o

English (Brill, 1995; Ratnaparkhi, 1996; Toutanova et al., 2003) Japanese (Nakagawa et al., 2002; Nakagawa and Uchimoto, 2007) Arabic (Aldarmaki and Diab, 2015) Vietnamese (Nghiem et al., 2008; Tran et al., 2009; Bach et al., 2013)

State-of-the-art POS taggers are statistical or machine learning based models trained on annotated corpora of conventional text



o

o o

3

Penn Treebank for English Kyoto corpus for Japanese Viet Treebank for Vietnamese Ngo Xuan Bach

Social Media Text 

Web 2.0 platforms such as blogs, forums, wikis, and social networks have facilitated the generation of a huge volume of user-generated text



Have become an important source for both data mining and NLP communities Require appropriate tools for text analysis

 4

Ngo Xuan Bach

POS Tagging for Social Media Text Social media text poses several challenges Facebook Sentences

Expected Sentences

Em đọc đc ấy mà a

Em đọc được ấy mà anh

I can read it

abbreviation

Nó good vậyyyy

Nó giỏi vậy

He is so good

foreign word, typo

Ng đàn ông mặc áo Người đàn ông đen cơ. :)) mặc áo đen cơ. :))

Must be the guy with a black shirt

abbreviation, emoticon

Toi thich cái màu trắng

I like the white one

word without tone mark

Tôi thích cái màu trắng

Translated Sentences

Problems

A POS tagger developed for conventional, edited text would perform poorly on such noisy data 5

Ngo Xuan Bach

This Work We develop a Vietnamese tagging model with various types of linguistic features and a new POS tagset



o

empirically show the effectiveness of the method on data from Vietnamese Facebook

We construct an annotated corpus for Vietnamese POS tagging



o

consisting of 4150 sentences collected from Facebook

Both annotated corpus and trained POS tagger are made available to the research community



6

Ngo Xuan Bach

Outline Introduction Tagging Method Experiments Summary

  



7

Ngo Xuan Bach

A POS Tagset for Social Media Text We extended the conventional POS tagset to cover the variations of social media text



8

Ngo Xuan Bach

Annotation Procedure We extracted textual content of posts and comments from Vietnamese Facebook



Raw text

Preprocessing

Automatic tagging

-Split sentences -Remove noisy sentences

-Tag sentences using vnTagger

Manual tagging -Two annotators manually correct sentences

Corpus Annotated corpus

We used the Cohen’s kappa coefficient to measure the inter-annotator agreement



o

9

The score was 0.84, which can be interpreted as almost perfect agreement Ngo Xuan Bach

Statistical Information of the Corpus #sentences #unique words 4150

6416

% 20

#tokens

#tokens/sentence

38498

9.3

POS tag distribution

19.2 18.46

18 16 14 12 10

8.86 8.04 6.68 5.89 5.23

8 6 4

3.8 3.543.52

3.17

2 0

2.412.36 1.751.551.311.15 0.930.790.490.32 0.3 0.210.05

POS tags V

10

N

PUN

R

A

P

AB

E

Np

T

C

M

HP

X

I

FL

Nc

L

CF

CC

Nu

SD

IL

AR

Ngo Xuan Bach

Tagging Model

Corpus

Feature Extraction

Conditional Random Fields

Tagging Model

Feature Type

Description

Basic features

unigrams, bigrams, and trigrams of words

Enhanced features

special character, icon or emoticon, digits, capitalization, hashtags and URLs

METAPH feature

used the Metaphone algorithm to create a coarse phonetic normalization of words to simpler keys

GENTAG features

the output (the predicted POS tags) of vnTagger - Unigrams, bigrams, trigrams of POS tags

11

Ngo Xuan Bach

Experiments

12

Ngo Xuan Bach

Experimental Setup  

Conduced 10-fold cross-validation Measured the performance by Accuracy, Precision, Recall, and the F1 score

13

Model

Characteristics

Baseline1

used he output of vnTagger

Baseline 2

used a list of icons to automatically correct the output of vnTagger

CRF1

used CRFs with basic features

CRF2

CRF1 + enhanced features

CRF3

CRF2 + METAPH features

CRF4

CRF3 + features from Baseline1

CRF5

CRF3 + features from Baseline2 Ngo Xuan Bach

Tagging Accuracy % 90

88.16

88 86 83.53

84 82

80.69

88.26

83.86

81.39

80 78

76.99

76

74 72 70

Tagging Accuracy

Baseline 1

14

Baseline 2

CRF1

CRF2

CRF3

CRF4

CRF5

Ngo Xuan Bach

Tagging Results in Detail

15

Ngo Xuan Bach

Confused Tags

16

Ngo Xuan Bach

Summary

17

Ngo Xuan Bach

Summary 

We have developed a Vietnamese part-of-speech tagging system for social media text o o



an annotated corpus from Facebook outperformed a state-of-the-art Vietnamese POS tagger trained on general text by a large margin

The tagger as well as the annotated data can be useful for further research not only on POS tagging but also other NLP tasks for Vietnamese social media text

18

Ngo Xuan Bach

19

Ngo Xuan Bach

Learning Semantic Correspondences with Less ...

of user-generated text. ▻ Have become an important source for both data mining and NLP communities. ▻ Require appropriate tools for text analysis.

802KB Sizes 0 Downloads 267 Views

Recommend Documents

Learning Semantic Correspondences with Less ...
Department of Computer Science, PTIT, Vietnam. Machine Learning & Applications Lab, PTIT, Vietnam. KSE 2015, Ho Chi Minh City - Vietnam, October 2015. +*.

Learning Semantic Correspondences with Less ...
Analyzing the Logical Structure of Law Sentences ..... Kudo, T.: Yet Another Japanese Dependency Structure Analyzer. http://chasen.org/ taku/software/cabocha/.

Doing Moore with Less – Leapfrogging Moore's ... - Semantic Scholar
Dec 9, 2016 - workstation operated by CentOS 7 and equipped with an Intel core i7 4770 (3.40 GHz) and 16GB of DDR3 RAM. The processor had. 4 cores, each with 2 hyperthreads, giving 8 logical CPUs. The Intel core i7 4770 processor has 2 important arch

Doing Moore with Less – Leapfrogging Moore's ... - Semantic Scholar
Dec 9, 2016 - principle during the first phase, we now use a novel second phase which involves reinvesting the saved ... major source of energy consumption of microprocessors [7, 8], the advantage of shorter words will become .... The operating syste

Large-Scale Learning with Less RAM via ... - Research at Google
such as those used for predicting ad click through rates. (CTR) for sponsored ... Streeter & McMahan, 2010) or for filtering email spam at scale (Goodman et al., ...

Doing more with less: Teacher professional learning ...
Jun 2, 2008 - opportunities, including joint lesson planning and the sharing of resources; ..... report that teachers use the computers to collect materials. ..... communities: Leadership, purposeful decision making, and job embedded staff.

Doing more with less: Teacher professional learning ...
Jun 2, 2008 - Administration, Graduate School of Education, Rutgers, The State University of New Jersey, ... (Hargreaves, 2000), the culture of teaching in the United States has long been ..... in the school, or short term training sessions held at a

Informal Learning with Semantic Wikis in Enterprises
Since the emergence of Web 2.0 and its easy to use web based applications, ordinary internet users are empowered to generate content themselves. (O'Reilly 04) which significantly contributes to information growth. During the last two years, it has be

Semantic Proximity Search on Graphs with Metagraph-based Learning
process online for enabling real-time search. ..... the best form of π within the family from the training examples ..... same school and the same degree or major.

Semantic Proximity Search on Graphs with Metagraph-based Learning
social networks, proximity search on graphs has been an active .... To compute the instances of a metagraph more efficiently, ...... rankings at top 10 nodes.

Learning sequence kernels - Semantic Scholar
such as the hard- or soft-margin SVMs, and analyzed more specifically the ..... The analysis of this optimization problem helps us prove the following theorem.

Fuzzy correspondences guided Gaussian mixture ...
Sep 12, 2017 - 1. Introduction. Point set registration (PSR) is a fundamental problem and has been widely applied in a variety of computer vision and pattern recognition tasks ..... 1 Bold capital letters denote a matrix X, xi denotes the ith row of

Candidate stability and voting correspondences - Springer Link
Jun 9, 2006 - Indeed, we see that, when candidates cannot vote and under different domains of preferences, candidate stability implies no harm and insignificance. We show that if candidates cannot vote and they compare sets according to their expecte

On strategy-proof social choice correspondences: a ...
Apr 2, 2008 - support from the Ministerio de Educación y Ciencia through the Programa ... sets and a generic definition of strategy-proofness for SCCs.

Learning Articulation from Cepstral Coefficients - Semantic Scholar
Parallel and Distributed Processing Laboratory, Department of Applied Informatics,. University ... training set), namely the fsew0 speaker data from the MOCHA.

UNSUPERVISED LEARNING OF SEMANTIC ... - Research at Google
model evaluation with only a small fraction of the labeled data. This allows us to measure the utility of unlabeled data in reducing an- notation requirements for any sound event classification application where unlabeled data is plentiful. 4.1. Data

Learning, Information Exchange, and Joint ... - Semantic Scholar
Atlanta, GA 303322/0280, [email protected]. 2 IIIA, Artificial Intelligence Research Institute - CSIC, Spanish Council for Scientific Research ... situation or problem — moreover, the reasoning needed to support the argumentation process will als

Backward Machine Transliteration by Learning ... - Semantic Scholar
Backward Machine Transliteration by Learning Phonetic Similarity1. Wei-Hao Lin. Language Technologies Institute. School of Computer Science. Carnegie ...

Learning Articulation from Cepstral Coefficients - Semantic Scholar
2-3cm posterior from the tongue blade sensor), and soft palate. Two channels for every sensor ... (ν−SVR), Principal Component Analysis (PCA) and Indepen-.

Transformation-based Learning for Semantic parsing
semantic hypothesis into the correct semantics by applying an ordered list of transformation rules. These rules are learnt auto- matically from a training corpus ...

Organizational Learning Capabilities and ... - Semantic Scholar
A set of questionnaire was distributed to selected academic ... Key words: Organizational learning capabilities (OLC) systems thinking Shared vision and mission ... principle and ambition as a guide to be successful. .... and databases.