Incorporating Sentiment Prior Knowledge for Weakly ...

Viewer
Transcript

Incorporating Sentiment Prior Knowledge for Weakly-Supervised Sentiment Analysis YULAN HE Knowledge Media Institute, The Open University, UK This paper presents two novel approaches for incorporating sentiment prior knowledge into the topic model for weakly-supervised sentiment analysis where sentiment labels are considered as topics. One is by modifying the Dirichlet prior for topic-word distribution (LDA-DP), the other is by augmenting the model objective function through adding terms that express preferences on expectations of sentiment labels of the lexicon words using generalized expectation criteria (LDA-GE). We conducted extensive experiments on English movie review data and multi-domain sentiment dataset as well as Chinese product reviews about mobile phones, digital cameras, MP3 players, and monitors. The results show that while both LDA-DP and LDA-GE perform comparably to existing weakly-supervised sentiment classification algorithms, they are much simpler and computationally efficient, rendering them more suitable for online and real-time sentiment classification on the Web. We observed that LDA-GE is more effective than LDA-DP, suggesting that it should be preferred when considering employing the topic model for sentiment analysis. Moreover, both models are able to extract highly domain-salient polarity words from text. Categories and Subject Descriptors: I.2.7 [Artificial Intelligence]: Natural Language Processing—Text analysis; H.3.1 [Information Storage and Retrieval]: Content Analysis and Indexing—Linguistic processing; H.3.3 [Information Storage and Retrieval]: Information Search and Retrieval—Information filtering General Terms: Algorithms,Experimentation Additional Key Words and Phrases: Sentiment analysis, Latent Dirichlet allocation, Generalized expectation, weakly-supervised sentiment classification

1. INTRODUCTION Sentiment analysis aims to understand subjective information such as opinions, attitudes, and feelings expressed in text. It has become a hot topic in recent years because of the explosion in availability of people’s attitudes and opinions expressed in social media including blogs, discussion forums, tweets, etc. Sentiment analysis found its way into a wide range of applications for tracking companies reputations, finding customers opinions about products/services and competitors, monitoring positive or negative trends in social media, etc. The rapid evolution of user-generated contents demands sentiment analysis tools that can easily adapt to new domains with minimum supervision. Most prior Authors’ address: Knowledge Media Institute, The Open University, Walton Hall, Milton Keynes MK7 6AA, UK; email: [email protected]. Permission to make digital/hard copy of all or part of this material without fee for personal or classroom use provided that the copies are not made or distributed for profit or commercial advantage, the ACM copyright/server notice, the title of the publication, and its date appear, and notice is given that copying is by permission of the ACM, Inc. To copy otherwise, to republish, to post on servers, or to redistribute to lists requires prior specific permission and/or a fee. c 20xx ACM 1529-3785/20xx/0700-0001 $5.00

ACM Transactions on Asian Language Information Processing, Vol. x, No. x, xx 20xx, Pages 1–0??.

2

·

work [Pang et al. 2002; Kim and Hovy 2004; Pang and Lee 2004; Choi et al. 2005; Blitzer et al. 2007; Zhao et al. 2008; Narayanan et al. 2009] view sentiment classification as a text classification problem where an annotated corpus with documents labeled with their sentiment orientation is required to train classifiers. As such they lack portability across different domains. Instead of using labeled documents for sentiment classifier training, there have been growing interests in exploring labeled words or labeled features as supervision information to classifier learning. For example, the word “excellent ” typically conveys positive sentiment. In recent years, much effort has been devoted to incorporate prior belief of word-sentiment associations from a sentiment lexicon into classifier learning by combining such lexical knowledge with a small set of labeled documents [Andreevskaia and Bergler 2008; Li et al. 2009; Melville et al. 2009]. This paper proposes weakly-supervised approaches based on latent Dirichlet allocation (LDA) [Blei et al. 2003] for sentiment classification by incorporating lexical knowledge obtained from available sentiment lexicons1 . We propose two possible ways in incorporating sentiment prior knowledge into LDA model learning, one is by modifying the Dirichlet prior for topic-word distribution (LDA-DP), the other is by augmenting the model objective function through adding terms which express preferences on expectations of sentiment labels of the lexicon words using generalized expectation criteria [McCallum et al. 2007] (LDA-GE). The proposed approaches perform sentiment analysis without the use of labeled documents. In addition, it is simple and computationally efficient; rendering them more suitable for online and real-time sentiment classification on the Web. The major contribution of this work is two-fold. —We proposed two principled approaches in incorporating prior knowledge into LDA model learning, LDA-DP and LDA-GE, and derived efficient training and inference procedures for LDA-DP based on Gibbs sampling, and for LDA-GE based on variational Bayes. While this paper mainly focuses on exploring sentiment prior knowledge, the proposed approaches can be used in more general settings where prior classes of certain terms are known. —We compared the performance of LDA-DP and LDA-GE extensively on both English and Chinese review data, including English movie review, multi-domain sentiment dataset, and Chinese product reviews on mobile phones, digital cameras, MP3 players, and monitors. The results show that our methods attain comparable or better performance than other previously proposed weakly-supervised or semi-supervised methods for sentiment classification despite using no labeled documents. In addition, we observed that LDA-GE is more effective than LDADP, suggesting that it should be preferred when incorporating prior knowledge into the topic model. The rest of the paper is structured as follows. Related work on weakly-supervised and semi-supervised sentiment classification are discussed in Section 2. The LDA model is introduced in Section 3. The proposed LDA-DP and LDA-GE are presented in Section 4 and 5 respectively, followed by a toy example illustrated in 1 The

paper is a substantial extension of [He 2011].

ACM Transactions on Asian Language Information Processing, Vol. x, No. x, xx 20xx.

·

3

Section 6. The experimental setup and results are discussed in Section 7 and 8. Finally, Section 9 concludes the paper and outlines directions for future research. 2. RELATED WORK Turney [Turney 2002] first proposed a sentiment classification approach that does not require labeled data. He calculated the semantic orientations of phrases in documents that contain adjectives or adverb as the pointwise mutual information (PMI) with a positive prototype “excellent ” minus the PMI with a negative prototype “poor ”. His approach achieved an accuracy of 84% for automobile reviews and 66% for movie reviews. While Turney only used one polarity prototype for each class (“excellent ” and “poor ”), Read and Carroll [2009] chose seven polarity prototypes which were obtained from Roget’s Thesaurus and WordNet and selected based on their respective frequency in the Gigaword corpus. They then measure the similarity between words and polarity prototypes in three different ways, lexical association (using PMI), semantic spaces, and distributional similarity. Still the best result was achieved using PMI with 69.1% accuracy obtained on the movie review data. Weakly-supervised sentiment classification approaches do not require labeled documents, instead, they use supervision information either from sentiment lexicons containing a list of words marked as positive or negative, or from user feedbacks. Lin and He [2009] proposed a joint sentiment-topic (JST) model to model both sentiment and topics from text and they incorporate sentiment prior information by modifying conditional probabilities used in Gibbs sampling during JST model learning. Dasgupta and Ng [2009] utilized user feedbacks in the spectral clustering process to ensure that text are clustered along the sentiment dimension. Features induced for each dimension of spectral clustering can be considered as sentimentoriented topics. Other weakly-supervised sentiment classification approaches typically adopt the self-training strategy. Zagibalov and Carroll [2008b; 2008a] start with a one-word sentiment seed vocabulary and use iterative training to gradually enlarge the seed vocabulary by adding more sentiment-bearing lexical items based on their relative frequency in both the positive and negative parts of the current training data. Sentiment direction of a document is then determined by the sum of sentiment scores of all the sentiment-bearing lexical items found in the document. Instead of using a one-word seed dictionary as in [Zagibalov and Carroll 2008b], Qiu et al. [2009] proposed to start with a much larger HowNet Chinese sentiment dictionary2 as the initial lexicon. Documents classified by the first phase are taken as the training set to train the SVMs which are subsequently used to revise the results produced by the first phase. Tan et al. [2008] proposed a combination of lexicon-based and corpus-based approaches that first labels some examples from a give domain using a sentiment lexicon and then trains a supervised classifier based on the labeled ones from the first stage. There have also been work combining labeled documents with lexical prior knowledge obtained from sentiment lexicons. For example, Andreevskaia and Bergler [2008] integrate a corpus-based classifier trained on a small set of annotated in-domain 2 http://www.keenage.com/download/sentiment.rar

ACM Transactions on Asian Language Information Processing, Vol. x, No. x, xx 20xx.

4

·

data and a lexicon-based system trained on WordNet for sentence-level sentiment annotation across different domains. Li et al. [2009] employ lexical prior knowledge for semi-supervised sentiment classification based on non-negative matrix trifactorization, where the domain-independent prior knowledge was incorporated in conjunction with domain-dependent unlabeled data and a few labeled documents. Melville et al. [2009] also combine lexical information from a sentiment lexicon with labeled documents where word-class probabilities in Na¨ıve Bayes classifier learning are calculated as a weighted combination of word-class distributions estimated from the sentiment lexicon and labeled documents respectively. Outside sentiment classification, much recent work has been conducted to explore labeled features in model learning without labeled instances. For example, some approaches use human annotated labeled features to generate pseudo-labeled examples that are subsequently used in standard supervised learning [Schapire et al. 2002; Wu and Srihari 2004]. Druck et al. [2008] proposed training discriminative probabilistic models with labeled features and unlabeled instances using generalized expectation (GE) criteria. Labeled features can come from human annotations or through unsupervised feature clustering with latent Dirichlet allocation (LDA). For LDA-generated features, the feature labels are generated by an oracle which assumes the availability of labeled instances. These soft constraints are then expressed as GE criteria. Incorporating supervised information into LDA model learning has been studied in [Blei and McAuliffe 2008; Mimno and McCallum 2008; Lacoste-Julien et al. 2008; Ramage et al. 2009]. Blei and McAuliffe [2008] proposed supervised LDA (sLDA) which uses the empirical topic frequencies as a covariant for a regression on document labels such as movie ratings. Mimno and McCallum [2008] proposed a Dirichlet-multinomial regression which uses a log-linear prior on document-topic distributions that is a function of observed features of the document, such as author, publication venue, references, and dates. DiscLDA [Lacoste-Julien et al. 2008] and Labeled LDA [Ramage et al. 2009] assume the availability of document class labels and utilize a transformation matrix to modify Dirichlet priors. DiscLDA introduces a class-dependent linear transformation to project a K-dimensional (K latent topics) document-topic distribution into a L-dimensional space (L document labels), while Labeled LDA simply defines a one-to-one correspondence between LDA’s latent topics and document labels. In contrast to the aforementioned methods, our proposed approaches incorporate sentiment prior knowledge by either modifying the Dirichlet prior for topic-word distributions or augmenting the LDA objective function through adding the generalized expectation criteria terms, and essentially create an informed prior distribution for the sentiment labels. 3. LATENT DIRICHLET ALLOCATION (LDA) Latent Dirichlet Allocation (LDA) [Blei et al. 2003] is a generative probabilistic model which is widely used in document analysis. It models the semantic relationships between words based on their co-occurrences in documents. It is based on a Bayesian model where each document is modeled as a mixture of latent topics, and a topic is a discrete probability distribution that defines how likely each word ACM Transactions on Asian Language Information Processing, Vol. x, No. x, xx 20xx.

·

5

is to appear in a given topic. For example, an LDA model might have topics that can be classified as Academic and Game. We would expect the Academic topic has higher probabilities of generating words such as university, research, professor etc. And the Game topic more likely generates words such as play, score, points etc. Words without special relevance, such as computer, will have roughly even probability between classes. LDA can be illustrated with the graphical model as shown in Figure 1(a). We use plate notations here where the boxes are “plates” representing replicates. The outer plate represents documents where M denotes the number of documents. The inner plate represents the repeated choice of topics and words within a document where Nd the number of words in document d. Shaded nodes denote observed variables and unshaded nodes denote hidden variables.

"

s

w

#

! Nd

"

S

M

(a) LDA

Fig. 1.

s

w

!

#

Nd M

$

S

(b) LDA-DP

The original LDA model and LDA-DP with Dirichlet prior transformed.

Assuming that we have a total number of S topics; a corpus with a collection of M documents is denoted by D = {d1 , d2 , ..., dM }; each document in the corpus is a sequence of Nd words denoted by d = (w1 , w2 , ..., wNd ), where the bold-font variables denote the vectors, and each word in the document is an item from a vocabulary index with V distinct terms denoted by {1, 2, ..., V }, the generative process is: —For each topic s ∈ {1, ..., S} —Choose a distribution ϕs ∼ Dir(β). —For each document d ∈ [1, M ] —Choose document length Nd ∼ Poisson(ξ). —Choose a distribution θd ∼ Dir(α). —For each of the Nd word position wt , —Choose a topic st ∼ Multinomial(θd ), —Choose a word wt ∼ Multinomial(ϕst ). Here, θd is the topic distribution for document d, ϕs is the word distribution for topic s, α is the parameter of the uniform Dirichlet prior on θd , β is the parameter of the uniform Dirichlet prior on ϕs . 4. LDA-DP: LDA WITH DIRICHLET PRIOR MODIFIED Existing sentiment classification approaches fall into three main categories, lexiconbased, corpus-based, or a combination of both. Lexicon-based approaches make use of a sentiment lexicon to classify a document as positive or negative by aggregating ACM Transactions on Asian Language Information Processing, Vol. x, No. x, xx 20xx.

6

·

sentiment scores of the words it contains, while corpus-based approaches typically learn a classification model from a labeled corpus. Contrary to existing approaches, we view sentiment classification as a generative problem that when an author writes a review document, he/she first decides on the overall sentiment or polarity (positive, negative, or neutral) of a document, then for each sentiment, decides on the words to be used. The LDA model, as shown in Figure 1(a), can be used to model a mixture of only three sentiment labels, i.e. positive, negative and neutral. We can incorporate sentiment prior knowledge into the LDA model by modifying the Dirichlet priors of word-topic distributions as shown in Figure 1(b) which we termed as LDA-DP. Here we have a total number of S sentiment labels S = {neutral, positive, negative}. The generative process is as follows: —For each sentiment label s ∈ {1, ..., S} —Draw ϕs ∼ Dir(λs × β Ts ). —For each document d ∈ [1, M ] —Choose document length Nd ∼ Poisson(ξ). —Choose a distribution θd ∼ Dir(α). —For each of the Nd word position wt , —Choose a sentiment label st ∼ Multinomial(θd ), —Choose a word wt ∼ Multinomial(ϕst ). Compared to the original LDA model, we add an additional dependency link of ϕ on the matrix λ of size S × V which we use to encode word prior sentiment information. For each sentiment label s and each word w, λsw encodes the word prior polarity probability. Intuitively, λ is initialized as an identity matrix with all the elements taking a value of 1. Given a sentiment lexicon, for each word w ∈ {1, ..., V }, if w is found in the sentiment lexicon, for each s ∈ {1, ..., S}, the element λsw is updated as follows: 0.9 if S(w) = s λsw = (1) 0.05 otherwise where the function S(w) returns the prior sentiment label of w in a sentiment lexicon and takes a value 0 if neutral, 1 if positive, and 2 if negative. The matrix λ can be considered as a transformation matrix which modifies the Dirichlet priors β so that the word prior sentiment polarity can be captured. For example, the word “excellent ” has a positive sentiment polarity. The corresponding row vector in λ is [0.05, 0.9, 0.05] with its elements representing neutral, positive, and negative prior polarity. Thus, the word has much higher probability to be drawn from the positive topic word distribution. The total probability of the model is P (w, s, θ, ϕ; α, β) =

S Y

j=1

P (ϕj ; λ × β)

M Y

d=1

P (θd ; α)

Nd Y

P (sd,t |θd )P (wd,t |ϕsd,t ) (2)

t=1

We use collapsed Gibbs sampling [Griffiths and Steyvers 2004] to approximate the posterior. Gibbs sampling is a Markov chain Monte Carlo method which allows us repeatedly sample from a Markov chain whose stationary distribution is the posterior of interest, sd,t here, from the distribution over that variable given the ACM Transactions on Asian Language Information Processing, Vol. x, No. x, xx 20xx.

·

7

current values of all other variables and the data. Such samples can be used to empirically estimate the target distribution. Letting the index x = (d, t) denote tth word in document d and the subscript −x denote a quantity that excludes data from tth word position, the conditional posterior for sx is: P (sx = j|s−x , w, α, β) ∝

{Nd,j }−x + αj {Nj,wt }−x + λj,wt βj,wt × PS PV {Nd }−x + j=1 αj {Nj }−x + r=1 λj,r βj,r

(3)

where Nj,wt is the number of times word wt has associated with sentiment label j; Nj is the the number of times words in the corpus assigned to sentiment label j; Nd,j is the number of times sentiment label j has been assigned to some word tokens in document d; Nd is the total number of words in the document collection. 5. LDA-GE: LDA WITH GENERALIZED EXPECTATION CRITERIA Sentiment prior knowledge can also be incorporated into the LDA model by augmenting the model objective function by adding terms which express preferences on expectations of sentiment labels of the lexicon words using generalized expectation criteria [McCallum et al. 2007], which we called LDA-GE. Letting Λ = {α, β}, we obtain the marginal distribution of a document w by integrating over θ and ϕ and summing over s: Z Z Nd X S Y Y p(st |θ)P (wt |st , ϕst )dθdϕ P (w|Λ) = P (θ; α) P (ϕs ; β) t=1 st

s=1

Taking the product of marginal probabilities of documents in a corpus gives us the probability of the corpus. P (D|Λ) =

M Y

P (wd |Λ)

d=1

Assume we have some labeled features where words are given with their prior sentiment orientation, we could construct a set of real-valued features of the observation to expresses some characteristic of the empirical distribution of the training data that should also hold of the model distribution. fjk (w, s) =

Nd M X X

δ(sd,t = j)δ(wd,t = k)

(4)

d=1 t=1

where δ(x) is an indicator function which takes a value of 1 if x is true, 0 otherwise. Equation 4 calculates how often feature k and sentiment label j co-occur in the corpus. We define the expectation of the features as EΛ [f (w, s)] = EP˜ (w) [EP (w|s;Λ) [f (w, s)]] where P˜ (w) is the empirical distribution of w in document corpus D, and P (w|s; Λ) is a conditional model distribution parameterized at Λ. EΛ [f (w, s)] is a matrix of size S × K where S is the total number of sentiment labels and K is the total number of features or constraints used in model learning. ACM Transactions on Asian Language Information Processing, Vol. x, No. x, xx 20xx.

8

·

The jkth entry denotes the expected number of times that feature k is assigned with label j. PD PNd δ(sd,t = j) into fjk , the feature By adding a normalization term zj = d=1 t=1 expectation becomes the predicted feature distribution on the label j, i.e. PD PNd δ(sd,t = j)δ(wd,t = k) ˜ P (k|j; Λ) = d=1 t=1 (5) zj We define a criterion that minimizes the KL divergence of the expected feature distribution and a target expectation ˆf , which is essentially an instance of generalized expectation criteria that penalizes the divergence of a specific model expectation from a target value. G(EΛ [f (w, s)]) = KL(ˆf||EΛ [f (w, s)])

(6)

We can use the target expectation ˆf to encode human or task prior knowledge. For example, the word “excellent ” typically represent a positive orientation. We would expect that this word more likely appears in positive documents. In our implementation, we adopted a simple heuristic approach [Schapire et al. 2002; Druck et al. 2008] that a majority of the probability mass for a feature is distributed uniformly among its associated labels, and the remaining probability mass is distributed uniformly among the other non-associated label(s). As we only have three sentiment labels here, the target expectation of a feature having its prior polarity (or associated sentiment label) is 0.9 and 0.05 for its non-associated sentiment labels. The above encodes word sentiment prior knowledge in the form of Pˆ (s|w). However, the actual target expectation used in our approach is Pˆ (w|s). We could perform the following simple transformation: Pˆ (s|w)P (w) ∝ Pˆ (s|w)P˜ (w) Pˆ (w|s) = P (s)

(7)

by assuming that the prior probability of w can be obtained from the empirical distribution of w in document corpus D, and the prior probability of the three sentiment labels are uniformly distributed in the corpus. We augment the likelihood maximization by adding the generalized expectation criteria objective function terms. O(D|Λ) = log P (D|Λ) − λG(EΛ [f (w, s)])

(8)

where λ is a penalized parameter which controls the relative influence of the prior knowledge. This parameter is empirically set to 100 for all the datasets. For brevity, we omit λ in the subsequent derivations. The learning of the LDA-GE model is to maximize the objective function in Equation 8. Exact inference on LDA-GE is intractable. We use the variational methods to approximate the posterior distribution over the latent variables. The variational distribution which is assumed to be fully factorized is: q(s, θ, ϕ|Ω) =

S Y s=1

q(ϕs |β˜s )

M Y d=1

q(θd |α ˜d )

N Y

q(sdt |˜ γdt )

t=1

˜ γ˜} are free variational parameters, θ ∼ Dirichlet(˜ where Ω = {α ˜, β, α), ϕ ∼ ˜ Dirichlet(β), and sdt ∼ Multinomial(˜ γ ). ACM Transactions on Asian Language Information Processing, Vol. x, No. x, xx 20xx.

·

9

We can bound the objective function in Equation 8 in the following way. O(D|Λ) ≥ Eq [log P (w, s, θ, ϕ|Λ) − G(EΛ [f (w, s)])] − Eq [log q(s, θ, ϕ)] By letting L(Ω; Λ) denote the RHS of the above equation, we have: O(D|Λ) = L(Ω; Λ) + KL(q(s, θ, ϕ|Ω)||P (s, θ, ϕ|Λ)) By maximizing the lower bound L(Ω; Λ) with respect to Ω is the same as minimizing the KL distance between the variational posterior probability and the true posterior probability. Expanding the lower bound by using the factorizations of P and q, we have: L(Ω; Λ) = Eq [log P (θ|α)] + Eq [log P (ϕ|β)] + Eq [log P (s|θ)] + Eq [log P (w|s, ϕ)] − Eq [log q(ϕ)] − Eq [log q(θ)] − Eq [log q(s)] − Eq [G(EΛ [f (w, s)])]

(9)

PK PK We define ∆(µ) ≡ log Γ( k=1 µk ) − k=1 log Γ(µk ) where Γ is Gamma function, each of the eight terms in the above equation can be expressed as:

L(Ω; Λ) =

+

M X

d=1

S X

S X

V X

(∆(α) +

(∆(β) +

v=1

M X N X S X d=1 t=1 s=1

+

M X N X S X V X S X

˜ + (∆(β)

M X

(∆(˜ α) +

−

V X β˜s,r )) γ˜d,t,s wd,t,v (Ψ(β˜s,v ) − Ψ(

V X v=1

d=1

M X N X S X

(11)

r=1

(12)

j=1

s=1

−

V X β˜s,r ))) (βv − 1)(Ψ(β˜s,v ) − Ψ(

j=1

S X α ˜d,j )) γ˜d,t,s (Ψ(˜ αd,s ) − Ψ(

d=1 t=1 s=1 v=1

−

(10)

s=1

s=1

+

S X α ˜ d,j ))) (αs − 1)(Ψ(˜ αd,s ) − Ψ(

V X β˜s,r ))) (β˜v − 1)(Ψ(β˜s,v ) − Ψ(

S X s=1

(13)

r=1

(14)

r=1

S X α ˜d,j ))) (˜ αs − 1)(Ψ(˜ αd,s ) − Ψ(

γ˜d,t,s log γ˜d,t,s

(15)

j=1

(16)

d=1 t=1 s=1

− Eq [G(EΛ [f (w, s)])]

(17)

where Ψ(·) is the digamma function, the first derivative of the log Γ(·) function. The first seven terms are the same as in the LDA model. We show how to ACM Transactions on Asian Language Information Processing, Vol. x, No. x, xx 20xx.

10

·

compute the last term in the above equation. For a sentiment label j Eq [G(EΛ [f (w, j)])] = Eq [

X w

≤

fˆjw log

fˆjw ] EΛ [fjw (w, j)]

X

fˆjw (log fˆjw − Eq [

X

fˆjw (log fˆjw −

w

=

Nd M X X

log(P (wd,t |sd,t ; Λ)δ(sd,t = j))])

d=1 t=1

w

Nd M X X d=1 t=1

γ˜d,t,s δ(sd,t

V X ˜ β˜j,r ))) = j)(Ψ(βj,w ) − Ψ( r=1

We then employ a variational expectation-maximization (EM) algorithm to estimate the variational parameters Ω and the model parameters Λ. —(E-step): For each word, optimize values for the variational parameters Ω = ˜ γ˜ }. The update rules are {α ˜ , β, α ˜ d,s = α +

Nd X

γ˜d,t,s

(18)

t=1

β˜s,v = β +

Nd M X X

δ(wd,t = v)˜ γd,t,s

(19)

d=1 t=1

γ˜d,t,s ∝

P exp(Ψ(˜ αd,s ) + (1 + fˆs,wd,t )(Ψ(β˜s,wd,t ) − Ψ( v β˜s,v ))) for labeled features P otherwise exp(Ψ(˜ αd,s ) + Ψ(β˜s,wd,t ) − Ψ( v β˜s,v )) (20)

—(M-step): To estimate the model parameters, we maximize the lower bound on the log likelihood with respect to the parameters Λ = {α, β}. There are no closed form solution for α and β and an iterative searching algorithm is used to find the maximal values. 6. A TOY EXAMPLE In this section, we illustrate the differences among LDA, LDA-DP, and LDA-GE by running a toy example with vocabulary size 10, 3 sentiment labels, and a sentiment lexicon containing only 5 words, out of which 3 are positive and 2 are negative. Figure 2 depicts the multinomial distributions P (w|s) under LDA, LDA-DP, and LDA-GE. It can be observed that LDA has difficulties in modeling words’ polarities. For example, the positive words, Word 8 and 10, have roughly even probability to be generated from either positive and neutral sentiment labels. On the contrary, both LDA-DP and LDA-GE are able to capture words’ prior polarities that Word 1 and 2 have much higher probabilities to be generated from the negative sentiment label. Likewise, Word 8, 9, and 10 are more likely to be generated from the positive sentiment label. For other words which do not have prior polarities, their sentiment labels are inferred by the co-occurrences with those words with prior polarities. ACM Transactions on Asian Language Information Processing, Vol. x, No. x, xx 20xx.

· Positive

Negative

Neutral

Positive

Negative

Neutral 1

08 0.8

0.8

0.8

06 0.6

0.6

0.6

0.4 0.2

P P(w|s)

1

P P(w|s)

P(w|ss)

Neutral 1

04 0.4 0.2

0 2

3

4

5 6 7 Word ID

8

9

10

Negative

5 6 7 Word ID

8

04 0.4

0

1

(a) LDA

Positive

0.2

0

1

11

2

3

4

5 6 Word ID

7

8

9

10

(b) LDA-DP

1

2

3

4

9

10

(c) LDA-GE

Fig. 2. Comparison of an example multinomial distributions P (w|s) under LDA, LDA-DP, and LDA-GE for 10 words and 3 sentiment labels. P (w|s) for different sentiment labels are depicted in different colors, with green denoting positive sentiment label, red denoting negative, and blue denoting neutral. LDA-DP and LDA-GE incorporated word polarity prior from a sentiment lexicon containing 5 words only, with Word 1 and 2 bearing the negative polarity, and Word 8, 9 and 10 bearing the positive polarity.

Table I. No. of docs (pos/neg)

Dataset Movie review Books DVDs Electronics Kitchen

1000/1000 1000/1000 1000/1000 1000/1000 1000/1000

Mobile DigiCam MP3 Monitor

1159/1158 853/852 390/389 341/342

Dataset and sentiment lexicon statistics. Corpus Vocabulary Matched polarity size size words (pos/neg) English Corpora 568,652 25,516 866/1625 281,953 14,823 680/1186 274,827 15,277 773/1190 171,113 7,354 312/453 141,790 6,420 338/412 Chinese Corpora 129,097 8,535 338/324 57,651 5,439 263/246 32,914 4,090 239/167 40,854 4,482 277/246

Prior Polarity word coverage 17.56% 17.89% 18.25% 11.19% 12.40% 6.38% 7.49% 7.13% 7.72%

7. EXPERIMENTAL SETUP We evaluated our proposed methods on both English and Chinese corpora. The English corpora comprises of the Movie review data3 and the multi-domain sentiment (MDS) dataset4 . The Movie review data consists of 1000 positive and 1000 negative movie reviews downloaded from the IMDB movie archive. The MDS dataset contains four different types of product reviews extracted from Amazon.com including Books, DVDs, Electronics and Kitchen appliances, with 1000 positive and 1000 negative reviews for each domain. The four Chinese corpora5 were derived from product reviews harvested from the website IT1686 with each corresponding to different types of product reviews 3 http://www.cs.cornell.edu/People/pabo/movie-review-data/ 4 http://www.cs.jhu.edu/

~ mdredze/datasets/sentiment/

5 http://www.informatics.sussex.ac.uk/users/tz21/dataZH.tar.gz 6 http://product.it168.com

ACM Transactions on Asian Language Information Processing, Vol. x, No. x, xx 20xx.

12

·

including mobile phones, digital cameras, MP3 players, and monitors [Zagibalov and Carroll 2008a]. All the reviews were tagged by their authors as either positive or negative overall. It can be seen that the Movie review data appears to be the largest dataset, nearly double the corpus size compared to that of Books and DVDs. The Electronics and Kitchen datasets are smaller with their vocabulary size being only half of that of Books and DVDs. The size of Chinese corpora are much smaller compared to the English datasets. The English MPQA subjectivity lexicon7 and the Chinese NTU Sentiment Dictionary (NTUSD)8 [Ku and Chen 2007] were used to extract sentiment prior knowledge for English and Chinese datasets respectively. It should be noted that both lexicons are domain-independent and do not bear any domain-specific information about the corpora used here. Preprocessing was performed on English datasets by first removing punctuation, numbers, non-alphabet characters and stopwords, and then stemming words to their root form. Chinese word segmentation was performed on the four Chinese corpora using the conditional random fields based Chinese Word Segmenter9 [Tseng et al. 2005]. Summary statistics of the datasets after preprocessing and the total number of matched polarity words (words can be found in the corresponding sentiment lexicon) for each dataset are shown in Table I. The prior polarity words coverage in each corpus is also listed in the last column of the table. 8. EXPERIMENTAL RESULTS This section presents the experimental results obtained using the LDA-DP and LDA-GE models tested on both the English and Chinese corpora. The results are averaged over five runs with different random initialization. 8.1 Comparison with Baseline Models We compare our proposed approaches with the two baseline methods as described below: —Lexicon labeling. We implemented a baseline model which simply assigns a score +1 and -1 to any matched positive and negative word respectively based on a sentiment lexicon. A review document is then classified as either positive or negative according to the aggregated sentiment score. Thus, in this baseline model, a document is classified as positive if there are more positive words than negative words in the document and vice versa. —LDA. We evaluated sentiment classification performance with the LDA model where the number of topics were set to 3 corresponding to the 3 sentiment labels. Table II shows the classification accuracy results on both the English and Chinese corpora. It can be observed that Lexicon labeling achieves an accuracy in the range of 53-76% with the best accuracy obtained using the Chinese NTUSD lexicon on the Chinese MP3 corpus. LDA model without incorporating any sentiment 7 http://www.cs.pitt.edu/mpqa/ 8 http://nlg18.csie.ntu.edu.tw:8080/opinion/pub1.html 9 http://nlp.stanford.edu/software/stanford-chinese-segmenter-2008-05-21.tar.gz

ACM Transactions on Asian Language Information Processing, Vol. x, No. x, xx 20xx.

·

13

Table II. Overall comparison in sentiment classification accuracy (%). The boldface numbers indicate the best results obtained in each dataset. Dataset Lexicon Labeling LDA LDA-DP LDA-GE English Corpora Movie review 66.90 55.82 71.15 72.67 Books 64.50 49.00 65.25 67.91 DVDs 65.00 51.80 71.16 71.39 Electronics 62.70 58.51 70.36 70.30 Kitchen 64.70 54.84 67.17 70.13 Average 64.76 53.99 69.02 70.46 Chinese Corpora Mobile 72.57 64.23 77.03 77.05 DigiCam 53.04 54.01 81.32 74.38 MP3 76.02 63.62 78.13 78.08 Monitor 64.96 50.28 83.36 82.00 Average 66.65 58.04 80.09 77.88

prior information performs quite poorly with its accuracy being only slightly better than random classification. A significant improvement is observed when the prior sentiment knowledge is incorporated. In general, LDA-GE performs better than all the other models on the English corpora except Electronics with the best accuracy of 72.67% being achieved on the Movie review data. As for the Chinese corpora, LDA-DP appears better than all the other models with the best accuracy of 83.36% being achieved on the Monitor corpus. 8.2 Comparison of LDA-DP and LDA-GE We have performed significance test and found that while both LDA-DP and LDAGE perform significantly better than LDA at the 0.01 significance level, there is no statistically significant difference between LDA-DP and LDA-GE. We thus conducted another set of experiments to further investigate the impact of the prior information on model learning. We compare LDA-DP and LDA-GE under two different settings. One is with random initialization. That is, the word prior sentiment information is only incorporated by transforming the Dirichlet prior for topic-word distribution in LDA-DP or used to modify the LDA objective function in LDA-GE. The other is initialization with prior information. That is, the word prior polarity information obtained from a sentiment lexicon is incorporated during the initialization stage of LDA model learning. Each word token in the corpus is compared against the words in a sentiment lexicon. The matched word token get assigned its prior sentiment label. Otherwise, it is assigned with a randomly selected sentiment label. The results under these two settings for LDA-DP and LDA-GE on both English and Chinese corpora are reported in Table III. It can be observed that with random initialization, LDA-DP only gives mediocre results with most of the accuracies being only slightly better than random classification. LDA-GE appears better than LDA-DP with an average of 63% achieved across both English and Chinese corpora. Initialized the models with word sentiment prior information, both LDA-DP and LDA-GE improve dramatically with an average of sentiment classification accuracy 70% being achieved for the English corpora and nearly 80% for the Chinese corpora. ACM Transactions on Asian Language Information Processing, Vol. x, No. x, xx 20xx.

14

· Table III.

Comparison of LDA-DP and LDA-GE in sentiment classification accuracy (%). LDA-DP LDA-DP LDA-GE LDA-GE Dataset random init init with prior random init init with prior English Corpora Movie review 64.00 71.15 65.87 72.67 Book 54.20 65.25 63.02 67.31 DVD 50.13 71.16 63.89 71.93 Electronics 50.13 70.36 61.12 70.28 Kitchen 55.12 67.17 60.93 70.13 Average 54.72 69.02 62.97 70.46 Chinese Corpora Mobile 51.37 77.03 62.99 77.05 DigiCam 53.01 81.32 55.62 74.38 MP3 69.03 78.13 68.32 78.08 Monitor 64.04 83.36 62.99 82.00 Average 59.36 80.09 62.48 77.88

We also plot classification accuracies versus training iterations for both LDA-DP and LDA-GE on the English and Chinese corpora as shown in Figure 3 and Figure 4 respectively. It can be observed from the figures that with random initialization, the performance of both LDA-DP and LDA-GE improves with the increase of training iterations. For LDA-DP, the improvement on the English corpora is in the range of 2-4% which is less significant compared to the Chinese corpora where an increase of 13-20% has been observed. LDA-GE improves over the initial classification results by 10-16% on the English corpora and 5-12% on the Chinese corpora. When initialized with prior polarity information, the accuracies obtained using LDA-DP appear to be the best at the beginning and either fluctuate in a small range of 1-2% or drop slightly before stabilizing again such as on the English book corpus or the Chinese MP3 corpus. It seems that when LDA-DP is initialized with word sentiment prior information, the effect of transforming the Dirichlet prior of topic-word distribution appears to diminish. Yet for LDA-GE initialized with prior information, a similar trend is observed as with random initialization that the performance improves with the increasing number of EM iterations and converges at about 6th iteration for most of the corpora. We also notice that for some of the Chinese corpora, the performance of LDA-GE goes through a peak, then dipped slightly before convergence. This is because the variational Bayes procedure does not maximize the classification accuracy directly, instead, it maximizes the loglikelihood of the data and the convergence is monitored based on the changes of a loss function, rather than accuracy. In general, LDA-GE employing variational Bayes algorithm exhibits faster convergence rate compared to LDA-DP which used collapsed Gibbs sampling. This is the same as what have been observed by other researchers that the stochastic nature of collapsed Gibbs sampling causes it to converge more slowly than the deterministic algorithms [Asuncion et al. 2009]. We also compare the run time of LDA-DP and LDA-GE versus LDA. For LDA and LDA-DP, we run 2000 Gibbs sampling iterations on all the datasets. The Dirichlet prior on the per-document topic distributions, α, was initially set to L × 0.05/S, where L is the average document length and S is the total number ACM Transactions on Asian Language Information Processing, Vol. x, No. x, xx 20xx.

· Book

DVD

Electronics

Kitchen

Movie review

Book

Electronics

Kitchen

65 60 55 50

70

65

60

200

400

600

800 1000 1200 1400 1600 1800 2000

200

400

600

Gibbs sampling iterations

Movie Reivew

Book

DVD

Electronics

800 1000 1200 1400 1600 1800 2000 Gibbs sampling iterations

(a) LDA-DP random init.

(b) LDA-DP init with prior. Movie Reivew

Kitchen

75

75

70

70 Accuracy curacy (%)

Accuracy curacy (%)

DVD

75

Accuracy curacy (%)

Accuracy curacy (%)

Movie review 70

15

65 60

Book

DVD

Electronics

Kitchen

65 60 55

55 2

4

6

8

10 12 14 16 18 20 22 24 26 28 EM Iterations

(c) LDA-GE random init.

2

4

6

8

10 12 14 16 18 20 22 24 26 28 EM iterations

(d) LDA-GE init with prior.

Fig. 3. Comparison of the performance on the English corpora. of sentiment labels, and was re-estimated from data using maximum-likelihood estimation [Minka 2003] every 40 Gibbs sampling iterations. For LDA-GE, the model runs until convergence in the EM procedure or until it reaches a maximum of 200 EM iterations. For all the datasets tested here, LDA-GE converges in less than 30 EM iterations. Figure 5 shows the average run time over five different runs for each model on each dataset using a computer with a duo core CPU 2.8GHz and 2G memory. It can be observed that LDA-DP and LDA have roughly the same processing time. But LDA-GE runs significantly faster that it only uses in average less than half of the run time compared to both LDA and LDA-DP. The above results suggest that incorporating sentiment prior knowledge by augmenting the model objective function through adding generalized expectation criteria terms is more effective than modifying the Dirichlet prior for topic-word distribution. Thus, LDA-GE should be preferred over LDA-DP when considering employing the topic model for sentiment analysis. 8.3 Comparison with Existing Approaches Li et al. [2009] employed lexical prior knowledge extracted from a sentiment lexicon that was developed in the IBM India Research Labs [Ramakrishnan et al. 2003] for semi-supervised sentiment classification based on non-negative matrix tri-factorization. Such domain-independent prior knowledge was incorporated in conjunction with domain-dependent unlabeled data and a few labeled documents for model learning. With 10% of labeled documents for training, the non-negative ACM Transactions on Asian Language Information Processing, Vol. x, No. x, xx 20xx.

·

16

Mobile

DigiCam

MP3

Mobile

Monitor

Accuracy (%)

70 Accuracy ccuracy (%)

DigiCam

MP3

Monitor

85

75

65 60 55

80 75 70

50 65

45 200

400

600

200

800 1000 1200 1400 1600 1800 2000

400

600

(a) LDA-DP random init. DigiCam

MP3

(b) LDA-DP init with prior. Mobile

Monitor

85

85

75

75

Accuracy curacy (%)

Accuracy curacy (%)

Mobile

800 1000 1200 1400 1600 1800 2000 Gibbs sampling iterations

Gibbs sampling iterations

65 55

DigiCam

MP3

Monitor

65 55 45

45 2

4

6

2

8 10 12 14 16 18 20 22 24 26 28 30

4

6

8 10 12 14 16 18 20 22 24 26 28 30

EM iterations

EM iterations

(c) LDA-GE random init.

(d) LDA-GE init with prior.

Fig. 4. Comparison of the performance on the Chinese corpora. LDA

LDA DP

LDA GE

250

Seconds

200 150 100 50 0 Movie

Book

DVD

Electronics

Kitchen

Mobile

DigiCam

MP3

Monitor

Dataset

Fig. 5.

Runtime comparison of different models.

matrix tri-factorization approach performed much worse than our approach with a difference of 7%-11% for LDA-DP and 10%-13% for LDA-GE on both the movie review data and MDS. With 40% labeled documents, their approach gives a similar result on the movie review data as LDA-GE which uses no labeled documents. Lin and He [2010] incorporated sentiment prior information extracted from both the MPQA subjectivity lexicon and the appraisal lexicon10 by modifying condi10 http://lingcog.iit.edu/arc/appraisal_lexicon_2007b.tar.gz

ACM Transactions on Asian Language Information Processing, Vol. x, No. x, xx 20xx.

·

17

tional probabilities used in Gibbs sampling during LDA model learning. The best sentiment classification results obtained are 74.1% on the movie review data and 69.3% on MDS. Our LDA-GE performs slightly worse than theirs on the movie review data, but gives a better result on MDS. Dasgupta and Ng [2009] proposed a weakly-supervised sentiment classification algorithm where user feedbacks are provided on the spectral clustering process in an interactive manner to ensure that text are clustered along the sentiment dimension. Users are allowed to specify the dimension along which she wants the data points to be clustered via inspecting a small number of words. They removed words that occur in only a single review and the top 1.5% words after sorting the vocabulary by document frequency. And we did not perform such preprocessing. Their proposed approach achieved 70.9% classification accuracy on the movie review data and an average of 68.95% on the MDS dataset. Both of our LDA-DP and LDA-GE give slightly better performance on both datasets. 8.4 Domain-Specific Polarity-Bearing Words While a generic sentiment lexicon provides useful prior knowledge for sentiment analysis, the contextual polarity of a word may be quite different from its prior polarity. Positive words may appear in sentences describing negative sentiment, and vice versa. Also, the same word might have different polarity in different domain. For example, the word “small ” is positive when used to describe a mobile phone, but it is negative if it is used to describe a SUV. Thus, it is worth to automatically distinguish between prior and contextual polarity. Our proposed approach starts with a generic sentiment lexicon and incorporate the word sentiment prior knowledge into model learning. It is able to learn the sentiment-word probabilities from a particular domain and thereby reflect a domain-specific sentiment polarity for each word. Table IV lists some extra polarity words extracted by LDA-GE which are not found in either the MPQA subjectivity lexicon for the English corpora or the NTUSD sentiment lexicon for the Chinese corpora. We can see that LDA-GE is able to identify domain-specific polarity words. For example, oscar for movie reviews, plai for Books, classic for DVDs, cordless for Electronics, and stainless and drip for Kitchen. (comFrom the Chinese corpora, example domain-specific words include pact) for mobile phones, (telephoto) for digital cameras, (metallic) and (noise) for MP3, (flat screen) and (distortion) for monitors. The iterative approach proposed in [Zagibalov and Carroll 2008a] can also automatically acquire polarity words from data. However, it appears that only positive words were identified by their approach. Our proposed LDA-GE model can extract both positive and negative words and most of them are highly domain-salient as can be seen from Table IV.

& ¯s

Bó

1

Ñ^

ç

9. CONCLUSIONS This paper has proposed two different ways for incorporating sentiment prior knowledge into the topic model for weakly-supervised sentiment analysis where sentiment labels are considered as topics. Prior information from sentiment lexicons are either incorporated by modifying the Dirichlet prior for topic-word distribution (LDAACM Transactions on Asian Language Information Processing, Vol. x, No. x, xx 20xx.

18

·

Table IV. Extracted example polarity words by LDA-GE. Corpus Movie rev.

Pos

DVDs

Neg Pos Neg Pos

Electronics

Neg Pos

Kitchen

Neg Pos

Books

Neg Mobile

Pos

Neg DigiCam

Pos

Neg

MP3

Pos

Neg

Monitors

Pos

Neg

Extracted Polarity Words English Corpora classic, comedi, detail, dramat, emot, oscar, ryan, strong, touch alien, attack, dark, gui, hard, scream, sex, thriller, violenc art, clear, detail, easi, insight, profession, simpl, specif, valu crime, dark, dead, kill, lost, plai, poor, sex, slow, wrong art, bonu, classic, comedi, easi, famili, highli, special, top, won alien, budget, dark, dull, low, murder, poor, sex, stuck, wast bright, clear, cordless, design, easi, high, light, loud, small, top break, dead, error, fix, junk, plai, repair, replac, return, wast deal, durabl, easili, large, light, safe, sharp, soft, solid, stainless break, cost, drip, hot, leak, loud, plastic, poor, smell, wrong Chinese Corpora (not bad;pretty good), (user-friendly), (fashioable), (easy to use), (compact), (comfortable), (thin;light) (bluetooth), (strong;strength), (easy) (bad), (poor), (clash), (slow), (no;not), (difficult;hard), (less), (repair) (simple), (shake reduction), (advantage), (compact), (fashionable), (strong;strength), (telephoto), (dynamic), (comprehensive), (professional) (regret), (bad), (poor), (return;refund), (slow), (dark), (expensive), (difficult;hard), (consume much electricity), (plastic), (repair) (outstanding), (compact) (comprehensive) (simple), (strong;strength), (beautiful), (textual), (metallic), (not bad;pretty good) (noise), (consume much electricity), (poor), (bad), (short), (expensive), (substandard), (crash), (no), (but) (professional), (in focus), (fashionable), (concise), (energy efficient), (flat screen), (not bad;pretty good), (comfortable), (looks bright), (sharp) (deformation), (blurred), (serious;severe), (distortion), (color cast bad), (bad), (poor), (leakage of light), (black screen), (dark), (jitter)

O U ç & Æ úr U Bó

Øb I

º' }( ç ÝY ¹ î {: b ¡ î 2 ¹ ö : ¨ h O î ' 5 ¾ Q î ç Ph : Â Ñ^

95 î í 5 ! ¡ F/ p ö ý ¯s >® ) !Ê %Í Or O î ÑO

ACM Transactions on Asian Language Information Processing, Vol. x, No. x, xx 20xx.

ö : ¾ b 5

( O {:

1 ¨

·

19

DP), or by augmenting the model objective function through adding terms that express preferences on expectations of sentiment labels of the lexicon words using generalized expectation criteria (LDA-GE). Experimental results on both the English and Chinese corpora show that our approaches attain comparable or better performance than exiting weakly-supervised sentiment classification methods despite using no labeled documents. Moreover, the proposed approaches are simple and robust and do not require careful parameter tuning. We also found that LDAGE appears to be more effective than LDA-DP and thus should be preferred when considering employing the topic model for sentiment analysis. Although this paper primarily studies sentiment analysis, the proposed approach is applicable to any text classification task where some relevant prior knowledge is available. One issue relating to the proposed approach is that it still depends on the existence of a language-specific sentiment lexicon and thus cannot be applied to other less studied languages where no sentiment-related resources are available. A possible way to alleviate this problem is to construct a language-specific sentiment lexicon automatically from data and use it as the prior information source to be incorporated into model learning. Another promising direction for future work is to incorporate ontology engineering into weakly-supervised model learning. By incorporating domain-independent knowledge from a sentiment lexicon as well as domain knowledge from ontologies, we are hoping to reveal both topics and sentiment labels of a document simultaneously. REFERENCES Andreevskaia, A. and Bergler, S. 2008. When specialists and generalists work together: Overcoming domain dependence in sentiment tagging. In Proceedings of the Association for Computational Linguistics andthe Human Language Technology Conference (ACL-HLT). 290– 298. Asuncion, A., Welling, M., Smyth, P., and Teh, Y. W. 2009. On smoothing and inference for topic models. In Proceedings of the International Conference on Uncertainty in Artificial Intelligence (UAI). 27–34. Blei, D. and McAuliffe, J. 2008. Supervised topic models. Advances in Neural Information Processing Systems (NIPS) 20, 121–128. Blei, D. M., Ng, A. Y., and Jordan, M. I. 2003. Latent Dirichlet allocation. Journal of Machine Learning Research 3, 993–1022. Blitzer, J., Dredze, M., and Pereira, F. 2007. Biographies, bollywood, boom-boxes and blenders: Domain adaptation for sentiment classification. In Proceedings of the Association for Computational Linguistics (ACL). 440–447. Choi, Y., Cardie, C., Riloff, E., and Patwardhan, S. 2005. Identifying sources of opinions with conditional random fields and extraction patterns. In Proceedings of Human Language Technology Conference and Conference on Empirical Methods in Natural Language Processing (HLT/EMNLP). 355–362. Dasgupta, S. and Ng, V. 2009. Topic-wise, Sentiment-wise, or Otherwise? Identifying the Hidden Dimension for Unsupervised Text Classification. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). 580–589. Druck, G., Mann, G., and McCallum, A. 2008. Learning from labeled features using generalized expectation criteria. In Proceedings of the 31st Annual International ACM SIGIR Conference (SIGIR). 595–602. Griffiths, T. and Steyvers, M. 2004. Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America 101, Suppl 1, 5228–5235. ACM Transactions on Asian Language Information Processing, Vol. x, No. x, xx 20xx.

20

·

He, Y. 2011. Latent Sentiment Model for Weakly-Supervised Cross-Lingual Sentiment Classification. In Proceedings of the 33rd European Conference on Information Retrieval (ECIR). 214–225. Kim, S.-M. and Hovy, E. 2004. Determining the sentiment of opinions. In Proceedings of the International Conference on Computational Linguistics (COLING). 1367–1373. Ku, L. and Chen, H. 2007. Mining opinions from the Web: Beyond relevance retrieval. Journal of the American Society for Information Science and Technology 58, 12, 1838–1850. Lacoste-Julien, S., Sha, F., and Jordan, M. 2008. DiscLDA: Discriminative learning for dimensionality reduction and classification. In Advances in Neural Information Processing Systems (NIPS). Li, T., Zhang, Y., and Sindhwani, V. 2009. A non-negative matrix tri-factorization approach to sentiment classification with lexical prior knowledge. In Proceedings of the Joint conference of the 47th Annual Meeting of the Association for Computational Linguistics and the 4th International Joint Conference on Natural Language Processing (ACL-IJCNLP). 244–252. Lin, C. and He, Y. 2009. Joint sentiment/topic model for sentiment analysis. In Proceedings of the 18th ACM Conference on Information and Knowledge Management (CIKM). 375–384. Lin, C., He, Y., and Everson, R. 2010. A Comparative Study of Bayesian Models for Unsupervised Sentiment Detection. In Proceedings of the 14th Conference on Computational Natural Language Learning (CoNLL). 144–152. McCallum, A., Mann, G., and Druck, G. 2007. Generalized expectation criteria. Tech. Rep. 2007-60, University of Massachusetts Amherst. Melville, P., Gryc, W., and Lawrence, R. D. 2009. Sentiment analysis of blogs by combining lexical knowledge with text classification. In Proceedings of the 15th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD). 1275–1284. Mimno, D. and McCallum, A. 2008. Topic models conditioned on arbitrary features with dirichlet-multinomial regression. In Proceedings of the 24th Annual Conference on Uncertainty in Artificial Intelligence (UAI). 411–418. Minka, T. 2003. Estimating a dirichlet distribution. Tech. rep., MIT. Narayanan, R., Liu, B., and Choudhary, A. 2009. Sentiment Analysis of Conditional Sentences. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). 180–189. Pang, B. and Lee, L. 2004. A sentimental education: sentiment analysis using subjectivity summarization based on minimum cuts. In Proceedings of the Annual Meeting on Association for Computational Linguistics (ACL). 271–278. Pang, B., Lee, L., and Vaithyanathan, S. 2002. Thumbs up?: sentiment classification using machine learning techniques. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). 79–86. Qiu, L., Zhang, W., Hu, C., and Zhao, K. 2009. Selc: a self-supervised model for sentiment classification. In Proceeding of the 18th ACM conference on Information and knowledge management (CIKM). 929–936. Ramage, D., Hall, D., Nallapati, R., and Manning, C. 2009. Labeled LDA: A supervised topic model for credit attribution in multi-labeled corpora. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). 248–256. Ramakrishnan, G., Jadhav, A., Joshi, A., Chakrabarti, S., and Bhattacharyya, P. 2003. Question answering via Bayesian inference on lexical relations. In Proceedings of the ACL 2003 workshop on Multilingual summarization and question answering. 1–10. Read, J. and Carroll, J. 2009. Weakly supervised techniques for domain-independent sentiment classification. In Proceeding of the 1st international CIKM workshop on Topic-sentiment analysis for mass opinion. 45–52. Schapire, R., Rochery, M., Rahim, M., and Gupta, N. 2002. Incorporating prior knowledge into boosting. In Proceedings of the 19th International Conference on Machine Learning (ICML). 538–545. Tan, S., Wang, Y., and Cheng, X. 2008. Combining learn-based and lexicon-based techniques for sentiment detection without using labeled examples. In Proceedings of the 31st annual ACM Transactions on Asian Language Information Processing, Vol. x, No. x, xx 20xx.

·

21

international ACM SIGIR conference on Research and development in information retrieval (SIGIR). 743–744. Tseng, H., Chang, P., Andrew, G., Jurafsky, D., and Manning, C. 2005. A conditional random field word segmenter. In Fourth SIGHAN Workshop on Chinese Language Processing. Vol. 37. Turney, P. D. 2002. Thumbs up or thumbs down?: semantic orientation applied to unsupervised classification of reviews. In Proceedings of the 40th Annual Meeting on Association for Computational Linguistics (ACL). 417–424. Wu, X. and Srihari, R. 2004. Incorporating prior knowledge with weighted margin support vector machines. In Proceedings of the 10th ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD). 326–333. Zagibalov, T. and Carroll, J. 2008a. Automatic seed word selection for unsupervised sentiment classification of Chinese text. In Proceedings of the 22nd International Conference on Computational Linguistics (COLING). 1073–1080. Zagibalov, T. and Carroll, J. 2008b. Unsupervised classification of sentiment and objectivity in chinese text. In Proceedings of the 3rd International Joint Conference on Natural Language Processing (IJCNLP). 304–311. Zhao, J., Liu, K., and Wang, G. 2008. Adding redundant features for CRFs-based sentence sentiment classification. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP). 117–126.

ACM Transactions on Asian Language Information Processing, Vol. x, No. x, xx 20xx.

Incorporating Prior Knowledge in a Cubic Spline ...

Prior Knowledge Driven Domain Adaptation

Weakly-supervised Joint Sentiment-Topic Detection from Text

Weakly-supervised Joint Sentiment-Topic Detection ...

Incorporating External Knowledge into Crowd ...

Prior Knowledge in Support Vector Kernels

Particles incorporating surfactants for pulmonary drug delivery

Incorporating heterogeneous information for ... - ACM Digital Library

WEAKLY

Particles incorporating surfactants for pulmonary drug delivery

Incorporating heterogeneous information for ...

Microscopically weakly singularly perturbed loads for a ...

A Weakly Coupled Adaptive Gossip Protocol for ...

Summary of Prior Year's Obligations and Unpaid Prior Year's ...