Mining Vietnamese Comparative Sentences for Sentiment Analysis Ngo Xuan Bach+* Nguyen Dinh Tai+

Pham Duc Van+ Tu Minh Phuong+*

+

Department of Computer Science, PTIT, Vietnam

*

Machine Learning & Applications Lab, PTIT, Vietnam

KSE 2015, Ho Chi Minh City - Vietnam, October 2015

Sentiment analysis & Opinion mining Analyzing opinionated texts, such as opinions, emotions, sentiments, evaluations, beliefs, and speculations



o o

Help customers in choosing products and services Provide useful information for companies and vendors in marketing and market studies

Which hotel should I stay?

2

Ngo Xuan Bach

Sentiment classification Most existing work in sentiment analysis and opinion mining focuses on sentiment classification

Classify sentences/documents (e.g. reviews) based on the overall sentiments expressed by authors



o

positive, negative and (possibly) neutral

Examples



It was a wonderful trip. That hotel provides very bad services. 3

Ngo Xuan Bach

Mining comparative sentences An important task in sentiment analysis and opinion mining



o

Comparative sentences have specific structures 

o

Compare two entities (or sets of entities) in some features or aspects

Several work has been done for English and some other languages

Consists of two subtasks



o

Identifying comparative sentences 

o

Recognition of relations 

4

Identify comparative sentences in documents and classify them into some types Recognize entities, features, and comparing words in a comparative sentence Ngo Xuan Bach

An example The display quality of mobile phone X is better than that of mobile phone Y

Identifying comparative sentences



o

Sentence type:

non-equative comparative sentence

Recognition of relations



o o o

Two entities: mobile phone X and mobile phone Y Features: display quality Comparing words: better than “mobile phone X” is the preferred entity

5

Ngo Xuan Bach

This work Presents a framework for mining Vietnamese comparative sentence



o o

Subtask 1: a classification problem Subtask 2: a sequence learning problem

Introduces a corpus for the task



o

The domain of electronic devices

Describes a series of experiments on the task



o

Different learning methods and feature sets

The first work conducted for Vietnamese

6

Ngo Xuan Bach

Outline Motivation Our method Experiments Summary

 

 

7

Ngo Xuan Bach

A framework for mining Vietnamese comparative sentences

The focus of this work

8

Ngo Xuan Bach

Identifying comparative sentences We consider 3 types of comparative sentences

Equative



o

The camera of mobile phone X is similar to the one of mobile phone Y

Non-equative



o

The camera of mobile phone X is better than the one of mobile phone Y

Superlative



o

9

Iphone 5S is the most expensive one in the Iphone series

Ngo Xuan Bach

Identifying comparative sentences 

We model the subtask as a classification problem o

o



Learning methods o o



Input: a sentence Output: 1 (equative), 2 (non-equative), 3 (superlative), 0 (noncomparative)

Maximum Entropy Models (Berger et al., 1996) Support Vector Machines (Vapnik, 1998)

Features o

10

Words, syllables, n-grams

Ngo Xuan Bach

Relation recognition  

Input: a comparative sentence Output: entities, features, and comparing words

11

Ngo Xuan Bach

Relation recognition 

We model the subtask as a sequence learning problem o



Learning method o



A sentence is a sequence of words (or syllables) CRF (Lafferty et al., 2001)

Use IOB notation

Examples of sequence labels in a syllable-based model 12

Ngo Xuan Bach

Experiments

13

Ngo Xuan Bach

Datasets 

Collected from newspaper on the domain of electronic devices o



VnReview1 and TinhTe2

Contains 4000 Vietnamese sentences (1000 sentences for each types of comparative sentences and 1000 noncomparative sentences) o o o

5119 entities 2942 features 1087 comparing words (only in non-equative type)

1http://vnreview.vn 2https://www.tinhte.vn 14

Ngo Xuan Bach

Experimental Setups 

Subtask 1 o

o o o



Using all 4000 sentences 5-fold cross validation test Tools: LibSVM1 with RBF Kernel Measures: Accuracy, Precision, Recall, and the F1 score

Subtask 2 o o o

o

Using 3000 comparative sentences 5-fold cross validation test Tools: CRF++2 by Kudo Measures: Precision, Recall, and the F1 score

1https://www.csie.ntu.edu.tw/~cjlin/libsvm/ 2http://taku910.github.io/crfpp/ 15

Ngo Xuan Bach

Comparative sentence identification Experimental results using SVM Feature Extraction Method Syllable-based

Word-based

 

Feature Set

Accuracy (%)

1-grams

83.27

1-grams + 2-grams

86.30

1-grams + 2-grams + 3-grams

84.31

1-grams

82.59

1-grams + 2-grams

86.11

1-grams + 2-grams + 3-grams

83.22

Syllable-based method got better results than word-based method in all three cases of feature sets Using 1-grams and 2-grams achieved the best results for both methods 16

Ngo Xuan Bach

Comparative sentence identification Experimental results using SVM for each sentence type



Sentence type

Precision (%)

Recall (%)

F1(%)

Equative

86.93

92.00

89.38

Non-equative

82.18

80.51

81.32

Superlative

93.70

89.97

91.79

Superlative sentences had the highest F1 score o

o

Usually contain some specific phrases, such as “the best”, “the worst”, and “all others” The structure of superlative sentences is different from other types 

17

Equative and non-equative sentences compare two entities, superlative sentences compare an entity with all the others Ngo Xuan Bach

Comparative sentence identification SVM vs. MEM

18

Ngo Xuan Bach

Relation recognition Experimental results using CRF with different feature sets

Model

 

Precision (%)

Recall (%)

F1(%)

Window size = 1

90.00

81.33

85.89

Window size = 2

91.21

81.66

86.17

Window size = 3

91.36

81.73

86.28

Without POS tags

91.71

77.52

84.02

Using window size 3 got the best results In general, the window sizes did not affect very much to the experimental results

19

Ngo Xuan Bach

Relation recognition Experimental results (F1 score) of relation recognition in detail Entity

Feature

Comparing Word

Window size = 1

93.62

76.88

73.06

Window size = 2

93.44

78.04

73.74

Window size = 3

93.33

78.52

73.48

Without POS tags

91.64

75.75

70.79

Model

 

The first 3 models achieved nearly the same results POS tags played an important role in relation recognition

20

Ngo Xuan Bach

Relation recognition Experimental results on each type of sentence

Model

Entity

Feature

Pre(%)

Rec(%)

F1(%)

Pre(%)

Rec(%)

F1(%)

Equative

95.78

82.35

88.56

83.33

63.39

72.00

Non-equative

95.10

91.35

93.19

83.80

65.50

73.53

Superlative

95.50

92.79

94.12

88.49

73.00

80.00



Similar to the first subtask, we achieved the highest results on superlative comparison sentences on both entities and features

Summary

22

Ngo Xuan Bach

Summary 

Presented an empirical study on mining Vietnamese comparative sentences o o



Introduced a new corpus for this task o



Subtask 1: Identifying comparative sentences Subtask 2: Recognition of relations 4000 Vietnamese sentences

Our model got promising results o o

23

Subtask 1: 86.30% accuracy Subtask 2: 93.33%, 78.52%, and 73.48% in the F1 score on recognition of entities, features, and comparing words, respectively Ngo Xuan Bach

Summary 

Future work o

Study joint models 

o

Complete the task 

24

Identify comparative sentences and recognize relations simultaneously Identify the overall opinion of comparative sentences

Ngo Xuan Bach

Thank you for your attention!

Learning Semantic Correspondences with Less ...

Department of Computer Science, PTIT, Vietnam. Machine Learning & Applications Lab, PTIT, Vietnam. KSE 2015, Ho Chi Minh City - Vietnam, October 2015. +*.

705KB Sizes 1 Downloads 57 Views

Recommend Documents

Learning Semantic Correspondences with Less ...
Analyzing the Logical Structure of Law Sentences ..... Kudo, T.: Yet Another Japanese Dependency Structure Analyzer. http://chasen.org/ taku/software/cabocha/.

Doing more with less: Teacher professional learning ...
Jun 2, 2008 - opportunities, including joint lesson planning and the sharing of resources; ..... report that teachers use the computers to collect materials. ..... communities: Leadership, purposeful decision making, and job embedded staff.

Doing more with less: Teacher professional learning ...
Jun 2, 2008 - Administration, Graduate School of Education, Rutgers, The State University of New Jersey, ... (Hargreaves, 2000), the culture of teaching in the United States has long been ..... in the school, or short term training sessions held at a

New Analysis and Algorithm for Learning with ... - Semantic Scholar
figure, the L1 distance is given by twice the area of the green rectangle. In the right ... The two areas are equal, thus disc(P, Q)=0. In terms of ..... SIAM J. Comput.

Semantic Proximity Search on Graphs with Metagraph-based Learning
social networks, proximity search on graphs has been an active .... To compute the instances of a metagraph more efficiently, ...... rankings at top 10 nodes.

Candidate stability and voting correspondences - Springer Link
Jun 9, 2006 - Indeed, we see that, when candidates cannot vote and under different domains of preferences, candidate stability implies no harm and insignificance. We show that if candidates cannot vote and they compare sets according to their expecte

Learning sequence kernels - Semantic Scholar
such as the hard- or soft-margin SVMs, and analyzed more specifically the ..... The analysis of this optimization problem helps us prove the following theorem.

Learning to Combine Discriminative Classifiers - Semantic Scholar
Jul 28, 2010 - [email protected] ABSTRACT. Much of research in data mining and machine learning has led to numerous practical applications.

Organizational Learning Capabilities and ... - Semantic Scholar
A set of questionnaire was distributed to selected academic ... Key words: Organizational learning capabilities (OLC) systems thinking Shared vision and mission ... principle and ambition as a guide to be successful. .... and databases.

Learning Articulation from Cepstral Coefficients - Semantic Scholar
Parallel and Distributed Processing Laboratory, Department of Applied Informatics,. University ... training set), namely the fsew0 speaker data from the MOCHA.