CornPittMich Sentiment Slot-Filling System at TAC 2014 Xilun Chen∗ , Arzoo Katiyar∗ , Xiaoan Yan∗ , Lu Wang∗ Carmen Banea∗∗ , Yoonjung Choi∗∗∗ , Lingjia Deng∗∗∗ Claire Cardie∗ , Rada Mihalcea∗∗ and Janyce Wiebe∗∗∗ ∗
Cornell University University of Michigan ∗∗∗ University of Pittsburgh ∗∗
Abstract We describe the 2014 system of the CornPittMich team for the KBP English Sentiment Slot-Filling (SSF) task. The central components of the architecture are two machinelearning-based fine-grained opinion analysis systems each of which treats opinion extraction as a sequence-tagging task. For each query, we process each sentence in the document specified in the query to identify expressions of sentiment along with their source and target.
1
Introduction
This paper describes the 2014 system of the CornPittMich team for the KBP English Sentiment SlotFilling (SSF) task. Our goal was to investigate the use of two new systems for fine-grained analysis of opinionated text — the CRF- and ILP-based opinion analysis system of Yang and Cardie (2013) and the bi-directional deep recurrent neural network method of ˙Irsoy and Cardie (2014). Each of our four submitted systems employs the same four-step architecture to process the (single) document associated with each query. The architecture is illustrated in Figure 1. First, we apply one or both opinion extraction systems to every sentence in the document, searching for directly stated opinions — Direct Subjective Expressions (DSEs) and opinions expressed indirectly — Expressive Subjective Expressions (ESEs) — as defined in Wiebe et al. (2005). We note that the CRF- and ILP-based system, henceforth ILPE XTRACTOR, jointly extracts the opinion holders and targets. In contrast, the
bi-directional deep recurrent neural network, henceforth BDRNN, only extracts the opinion expression. As a result, we need to extract the opinion holder and targets for BDRNN. This is accomplished in Step 2, which relies on the dependency paths between the extracted expressions and the candidate named-entities in the sentence. Step 3 classifies the polarity of each extracted opinion expressions. In the final step, we remove responses that do not match the query (e.g. the direction or polarity stated in the query), combine the results from each system (if both opinion systems were applied), and remove duplicates. We then convert the output into the required format, adjusting (if needed) the justification spans and slot-fillers. We find that the system that combines the results from both ILPE XTRACTOR and BDRNN produces the best overall scores, achieving a maximum F1score of 0.1281. In addition, we find that BDRNN exhibits higher recall and ILPE XTRACTOR, higher precision. Below we first describe in more detail the architecture of our system (Section 2) and each of its components. We then present our results along with an initial analysis of system errors (Section 3).
2
System Architecture
The high-level system architecture of the CornPittMich (CPM) system is shown in Figure 1. Given a query, we apply the opinion analysis components to the document given by the query to identify potential slot-fillers (the opinion expressions and the respective holders and targets) associated with the opinion query entity. Since each of our two opinion
KBP SSF 2014 System Overview Step 1: Op Extraction
Step 2: Add HT
Documents, Parse Trees
Bishan: ILPExtractor Ozan: BDRNN
Documents
NE Annotations
Solr Parse Trees
PreMatch Query A
Heuristics using parse tree to add opinion holder & target
PreMatch Query B
NE Annotations
Step 4: Ensemble Combine systems Remove duplicates Convert output format
Step 3: RefineSent Add Polarity
Refine the justification span
Annotations
Figure 1: Overall System Architecture
analysis systems extracts only a subset of necessary components for the slot fillers, the extracted potential fillers are later augmented and further filtered in subsequent phases. In the next section, we provide a short description of each component of the entire system. 2.1 2.1.1
Preprocessing Resolving Named Entities with the Knowledge Base
To optimize query-time access to the large corpus released as part of the KBP SSF task, we sought to resolve each SERIF named entity (and SERIFidentified coreferent mentions) to its unique knowledge base entry. More specifically, a query in the KBP SSF task is composed of a named entity (i.e., the query entity) and a sentiment relation (positive/negative-from or positive/negative-toward). The answer to the query should consist of a named entity that occurs in the prescribed sentiment relation with respect to the query entity. In the majority of queries, the named
entity is also accompanied by a knowledge base (KB) identifier that uniquely denotes the entity. During preprocessing, we seek to resolve all NEs and coreferent mentions to their entry in the KB. With proper indexing, this allows us to retrieve at evaluation time only those documents that contain information pertaining to the query entity. Difficulties stem from the non-canonical representation of the entities in natural text, which is largely caused by their orthographic realization or alternative representations. In order to alleviate this problem, we mine alternative mention representations from Wikipedia article annotations, and implement a voting mechanism that allows the most likely resolutions to surface. In particular, we first mine the entities’ surface forms using Wikipedia as a corpus. This is accomplished by leveraging the Wikipedia editors’ annotations, which resolve mentions to Wikipedia articles by employing a framework of internal hyperlinks. Based on these metrics, the second step consists of computing the likelihood that a mention would resolve to a particular knowledge base entry. Finally,
the last step involves processing the actual SERIF annotations and providing a list of potential knowledge base resolutions for each named entity. We explain the latter in more detail below. Given a document, the SERIF-formatted annotations provide a set of mentions for each named entity. For example, for a named entity we may encounter the following mentions: “Barack Obama,” “he,” “the president,” “B. Obama,” “Barack Hussein Obama,” etc. Each of these mentions allows the referred entity to gain a stronger contour, thus allowing these non-canonical representations, to cast their vote for the most likely knowledge base resolutions, by employing the Wikipedia resolution likelihoods computed earlier. At the end, for each named entity that we were able to resolve, a list of possible knowledge base resolutions is provided, with candidates ranked from the strongest to the weakest. 2.1.2
Indexing and Parsing the Corpus
We used Apache’s Lucene-based Solr search platform to index both the corpora and the KBP knowledge base itself. The corpora are searchable by keyword and document ID as well as via any metadata provided along with the raw text (such as author). The latter is helpful for retrieval of discussion form posts for which the query entity is the author. Since parse trees are utilized in various steps in our system, we compute and cache the parse tree when we process a document for the first time. The documents are parsed using the Stanford Parser augmented with the original character offsets for each token so that subsequent tokenization transformations can be inverted. Ultimately, for each retrieved document, a search returned raw text, structured text, parse trees for sentences, and sentences as ordered lists of tokens. In addition, some sanitization was done to strip out HTML artifacts such as angle brackets and other invalid characters. Furthermore, in order to produce text that would be easy to process by the sentiment analysis systems, metadata blocks like author, sender information, and other sections irrelevant to sentiment analysis were stripped out before handing over to sentiment systems for analysis. We also removed all the contents surrounded by ¡quote¿ tag in the discussion forum corpus, which correspond to the quotation to some previous posts. The same
content has been seen before and there is no need to keep these duplicate contents. This turned out to be a significant speed boost on discussion forum corpus since the vast majority of its contents are duplicate quotes. 2.2
Opinion Extraction
The documents retrieved for each query are processed by one, or both, of the sentiment analysis systems — BDRNN and ILPE XTRACTOR. The goal of these systems is to identify opinion expression tuples from each document: [holder, expression, target, polarity]. These will be filtered in a postprocessing step that removes duplicates and retains only those slot-fillers that match the query specifications. 2.2.1
ILPExtractor
The ILPE XTRACTOR system uses a Conditional Random Field (CRF) (Lafferty et al., 2001) and Integer Linear Program (ILP) based opinion extraction system (Yang and Cardie, 2013) for within sentence identification of subjective expressions, opinion targets and (possibly implicit) opinion holders. The system is trained over the MPQA corpus (Wiebe et al., 2005) and models a sentence as a sequence of segments, by relaxing the Markov assumption of classical CRFs, in turn, allowing the incorporation of segment-level labels (Yang and Cardie, 2012). For the KBP SSF task, we only identify Direct Subjective Expressions, e.g. “criticized”, “like”, “pit X against Y”. Integer Linear Programming is used to coordinate the construction of opinion relations from the set of possible subjective expresssions, targets and holders. This component establishes the connections between expression-holder and expression-target annotations. The ILPExtractor can simultaneously extract opinion expressions and their holders and targets if there are any. However, it does not assign polarity to the extracted opinions. In order to match the polarity specified in the query, we add polarity to the extracted opinions in one of the following steps. 2.2.2
BDRNN
The BDRNN system uses bi-directional deep Recurrent Neural Networks for the task of opinion expression extraction formulated as a token-level sequence-labeling task (˙Irsoy and Cardie, 2014).
Recurrent neural networks can operate on sequential data of variable length hence can be applied as a sequence labeler; whereas bi-directional RNNs incorporate information from the preceding as well as following tokens. With deep RNNs, we stack multiple hidden layers such that lower layers capture short term interactions among words and higher layers reflect interpretations aggregated over longer spans of text. We train our model over the MPQA corpus and identify both Direct Subjective Expressions (DSEs) and Expressive Subjective Expressions (ESEs) as defined in Wiebe et al. (2005) The BDRNN system can extract much more opinions than the ILPExtractor, providing a higher recall. It, however, does not give information on holders, targets or polarity. Therefore, we develop various post-processing steps to recover these information in order to output slot fillers matching the query. 2.3
Pre-Matching with Queries
In the opinion analysis step, our two systems extract all opinion expressions from the document, regardless of the relevance to the query. Each query specifies three requirements for a sentiment slot filler: the opinion holder, target, and polarity. The failure to match any one of the three will generate an invalid filler. Therefore, most of the extracted opinions from the previous step turn out to be irrelevant to the query, which may result in a significant amount of futile work in the following steps. We hence design two Pre-Matching phases against the query to rule out irrelevant opinions based on what we already know about the filler. The first PreMatch happens right after the opinion extraction. Since at this time, BDRNN has not provided information on holder/target, and neither system has assigned polarities to the opinions, we can only prune the fillers by their opinion expressions. The idea is that we (at this time) use the sentence in which the opinion expression lies as the justification span. If the query entity (as well as its coreferring mentions) is not mentioned in the entire sentence where the opinion expression lives, that extracted opinion will be considered irrelevant and removed from the filler candidates. This PreMatch phase turned out to be very effective: for the evaluation queries, only 13974 fillers were left out of 101846 BDRNN extracted.
Similarly, we have another PreMatch after recovering holders and targets (Section 2.4), where we can further remove those fillers whose opinion holder or target fail to match the query specification. 2.4
Adding Opinion Holders and Targets
As explained in Section 2.2, BDRNN does not extract any opinion holder or target, while ILPExtractor fails to do so sometimes. We hence devise some heuristics using dependency parse trees to recover those missing holders and targets. For discussion forum corpus, where author information exists for a given post, the author is also used as a potential opinion holder. For ILPExtractor, the recovered holder/target may sometimes duplicate or even conflict with the original extracted ones, but we simply keep both since we have post-processing steps that remove duplicates and resolve conflicts (Section 2.7). Similarly, when multiple holders/targets are recovered or the holder information recovered by the heuristics conflicts with the author, both are kept and the conflict will be resolved in subsequent stages. To recover holder and target information for a given opinion expression, we first generate a pool of potential candidates, which come from all the named entities and their co-referring mentions in the sentence. We further populate the candidate pool based on the query. For instance, for a “pos-from” query, the query entity should be the target of the sentiment. We then let those entities that co-refer with the query entity to be the opinion target candidates, and the remaining entities in the sentence to be the holder candidates. Then for each potential holder/target, the dependency path between it and the opinion expression is derived using a dependency parser to check whether it matches our heuristics: Opinion Holder The dependency path must contain subj and end with one of subj, nn or pass. Opinion Target The dependency path must contain obj or amod and end with one of subjpass, nn, obj, amod, poss or ccomp. After applying these heuristics, we augment the extracted fillers from the Opinion Extraction step with the newly recovered holders and targets and pass them to the next step.
2.5
Refining Justification Spans
As we mentioned in the previous sections, we have used the entire sentence where the opinion expression lives as the justification span, which may contain a lot of extraneous information. In addition, there is a length requirement for justification spans (at most 4 spans with ≤ 150 chars each), which may require further dividing the justification into multiple spans when it exceeds 150 characters even if it is not very long. Therefore, in this step, the justification spans are refined and split into multiple spans as necessary. The main idea is that we take the constitutional parse of the sentence, and only keep the shortest clausal constituent that covers the opinion expression as well as its holder and target. Afterwards, if the justification is still longer than 150, some heuristics are applied to split the response. 2.6
Assigning Polarity to Extracted Opinions
For assigning polarity to the extracted opinons, we simply used a Naive Bayes classifier with unigrams, bigrams and lexicons used in Wilson et al. (2005) as features. We train two classifiers – positive vs rest and negative vs. rest class; on the dataset used in Socher et al. (2013) We first run our two classifiers on the opinion expression extracted. If polarity assigned by the two classifier is different then based on the posterior probability values we assign the polarity to the expression. But if the polarity assigned by the two classifier is the same (rest-class) then we run our classifier on the entire sentence and assign the polarity extracted from the sentence to the opinion phrase. 2.7
Query Match and Response Ensemble
In this final post-processing step, we combine the results from the two systems by removing the duplicates, merging conflicting outputs, as well as some other post-processing procedures such as canonicalization of named entities, etc. First, a final matching against the query is done, using all of the three criteria: holder, target and polarity. For example, if the query type is “pos-from”, then we filter out all the negative opinions and consider only the opinions whose targets are the query
entity, according to the named entity co-reference information. We then remove the duplicates and resolve the conflicts in our produced responses. We detect duplicate opinions extracted by the ILPE XTRACTOR and BDRNN systems and return only one of them. First, the polarities of any pair of duplicate opinions must be the same. Then, if the offsets of the holders, the targets and the opinion expressions overlap, we assume that the two opinions are the same and report only one of them. If the two opinions are from difference sentences but both the holders and the targets refer to the same entity according to the named entity coreference annotations, then we report the two opinions in one line but with their offsets delimited by “,”, as required in the task description. Some additional post-processing are also done in this step, such as finding the canonical mention of the filler entity, which is done by utilizing the NE and CoRef information.
3
Results and Analysis
3.1
Submissions
We finally submitted four sets of results to the sentiment-slot-filling task. We describe our four submissions below. 3.1.1
Submission 1 : ILPExtractor + BDRNN
We simply combined the results from the two opinion extraction methods. We ran the two methods individually through the pipeline and submitted the combined result by taking a simple union and removing duplicates. We output 1147 slot responses for the 400 queries. The result from this run are summarised as below. Recall: (117+0) / (682+0) = 0.171 Precision: (117+0) / (1147) = 0.102 F1: 0.1279 3.1.2
Submission 2 : BDRNN
In the pipeline described in 2, we only use the opinion expressions extracted using BDRNN method. In this case, we output 1081 slot responses for the 400 queries. The result from this run are summarised as below.
Table 1: Sample Responses
Response#
1
2
3
4
Query
Pitso Mosimane XIN_ENG_20101012.0266 225238 PER pos-towards
Slot-response Sentence : South Africa’s national soccer coach Pitso Mosimane praised players Itumeleng Khune and Teko Modise Tuesday for their performances against Sierra Leone over the weekend. expression : praised holder : Pitso Mosimane target : players Itumeleng Khune and Teko Modise polarity : positive
Cam Newton NYT_ENG_20101205.0049 174183 PER pos-from
Sentence : ”We couldn’t stop him,” South Carolina coach Steve Spurrier said, adding, ”He’s almost a one-man show.” expression : almost a one-man show holder : He (co-refered to Cam Newton) target : him polarity : positive
Kate Middleton WPB_ENG_20101116.0031 8396 PER pos-from
Sentence : “Kate is a formidable character ; she ’s an intelligent , modern , middle-class girl who is not easily fazed , ” Robert Jobson , author of the book “ William ’s Princess , ” said in an interview . expression : intelligent, modern holder : NIL target : NIL polarity : positive
Luiz Inacio Lula da Silva AFP_ENG_20101220.0621 214238 PER pos-from
Sentence : Asked if he would risk his favorable legacy, Carvalho said ”that would be a risk ... but Lula would come back in a favorable position. expression : would come back in a favorable holder : Carvalho target : his polarity : negative
Recall: (112+0) / (682+0) = 0.164 Precision: (112+0) / (1081) = 0.103 F1: 0.1270 3.1.3 Submission 3 : ILPExtractor In the pipeline described in 2, we find the opinion expressions using the ILPExractor method with an added advantage that this method also outputs the opinion holders and targets and hence we rely comparatively less on the heuristics to find holders and targets. In this case, we output 194 slot responses for the 400 queries. This method has low recall as can be seen below but the highest precision among all our other systems. Recall: (28+0) / (682+0) = 0.041 Precision: (28+0) / (194) = 0.144 F1: 0.063 3.1.4
Submission 4 : ILPExtractor + BDRNN + NoPolarity This submission is similar to our first submission except that in the pipeline we skip the step where we assign polarity to the extracted opinions. We assume that the extracted opinion expressions are of the same polarity as given in the query. As we would expect, we output highest number of slot responses, i.e. 1877 for the 400 queries in this case. The result from this run are summarised as below. Recall: (164+0) / (682+0) = 0.240 Precision: (164+0) / (1877) = 0.087 F1: 0.1281 Thus we see that our fourth submission has the highest recall and also highest F1-score as compared to the other submissions. But this submission suffers w.r.t. precision. 3.2
Analysis
For the 400 evaluation queries in our best run, we output 1877 slot responses corresponding to 252 queries and the remaining 148 queries have NIL responses. Since our approach is a pipeline approach, we tried to analyse the output after each step in the
pipeline and find the possible reasons for missing some responses. In Table 1, we provide some examples of extraction from our pipeline. The 1st response is an example of the correct response extracted from our system. However, we find in the 2nd response that we were able to extract the expression and polarity correctly but we could not extract the correct holder and the target, hence this response was later removed to return a NIL-response to the query. Also, in some cases such as the 3rd response, we could not extract any holder and target using the heuristics leading to a final NIL response. In the 4th response, we extract incorrect polarity and hence we return NIL final response for this query. Thus, we find that we were able to extract the opinion expression for most queries. We still suffered because of the incorrect extraction of the holders and targets. The heuristics described earlier for their extraction is missing many cases. So at this point, we would like to think about the possibility of the joint extraction of opinion expressions along with the opinion holders and targets. In a few responses, we find that our naive polarity classifier is also responsible for incorrect responses. So we would also like to jointly extract the polarity in the models described before for opinion expression extraction.
4
Conclusions
We described our 2014 system for the KBP English Sentiment Slot-Filling (SSF) task. We use two main methods for extracting opinion expressions – ILPExtractor and BDRNN. We find that the BDRNN system has higher recall while ILPExtractor has higher precision. We trained both our models on the MPQA corpus. We find that combining results from the two systems perform the best. We used heuristics to find the opinion holders and targets and find that it is a potential bottleneck for the performance of our system. The important goals for us for the next year will be to improve the extraction of opinion holder and targets by building a joint-model for fine-grained opinion mining, which along with finding the opinion expression and its polarity also finds the opinion holders and targets.
References Ozan ˙Irsoy and Claire Cardie. 2014. Opinion mining with deep recurrent neural networks. In Empirical Methods in Natural Language Processing. John Lafferty, Andrew McCallum, and Fernando CN Pereira. 2001. Conditional random fields: Probabilistic models for segmenting and labeling sequence data. Richard Socher, Alex Perelygin, Jean Y. Wu, Jason Chuang, Christopher D. Manning, Andrew Y. Ng, and Christopher Potts. 2013. Recursive deep models for semantic compositionality over a sentiment treebank. J. Wiebe, T. Wilson, and C. Cardie. 2005. Annotating expressions of opinions and emotions in language. Language Resources and Evaluation, 39(2):165–210. Theresa Wilson, Janyce Wiebe, , and Paul Hoffmann. 2005. Recognizing contextual polarity in phrase-level sentiment analysis. In HLP/EMNLP, pages 347–354. Bishan Yang and Claire Cardie. 2012. Extracting opinion expressions with semi-markov conditional random fields. In Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning, pages 1335–1345, Jeju Island, Korea, July. Association for Computational Linguistics. Bishan Yang and Claire Cardie. 2013. Joint inference for fine-grained opinion extraction. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1640–1649. Association for Computational Linguistics.