Strong Baselines for Cross-Lingual Entity Linking Valentin I. Spitkovsky, Angel X. Chang Computer Science Department, Stanford University, Stanford, CA, 94305 Google Research, Google Inc., Mountain View, CA, 94043 {valentin, angelx}@{cs.stanford.edu, google.com} Abstract We describe several context-independent baselines for tackling the cross-lingual entity linking task. Our methods are quite basic, reducing to efficient look-ups in static, pre-computed tables. Despite their simplicity, however, such approaches scored well in a recent knowledgebase population competition. Moreover, these language-independent techniques still perform strongly on English entity linking tasks.

Introduction The entity linking task — as defined in Knowledge-Base Population (KBP) tracks at the Text Analysis Conference (TAC) — is a challenge to associate string mentions in documents with articles in a knowledge base (KB). In the two earliest TAC-KBPs, the KB was a subset of the English Wikipedia, and the documents were also in English (McNamee and Dang, 2009; Ji et al., 2010). In 2011, the conference’s organizers created a new, cross-lingual track, in which mentions and documents could be in English or in Chinese, although the KB still remained English-centric. Somewhat surprisingly, context-independent techniques developed by Stanford-UBC — which ignore the documents and focus on just the mention of each query — have managed to score above the median entries in all previous English entity linking evaluations (Agirre et al., 2009; Chang et al., 2010; Chang et al., 2011). At the core of that approach were several static, English-specific dictionaries for mapping short strings of natural language text to canonical article titles from the English Wikipedia. Since the dictionaries were English-specific, unmodified look-up methods led to below-median performance on the cross-lingual entity linking task (Chang et al., 2011). We will show how such dictionaries can be improved — using conceptuallysimple modifications — to again score above the median.

New and Improved Components We will now describe several key dictionary components and our improvements over the original Stanford-UBC dictionary from 2009 (which was reused in 2010 and 2011). Remapper The remapper attempts to group various English Wikipedia titles that, in fact, refer to the same article, by mapping them to a canonical URL (Agirre et al., 2009, §2.1). This year, we improved the original remapper in several ways, the most important of which was disallowing merging two clusters if both of them contain an entry from the KB. Other improvements had to do with better handling of KB entries whose Wikipedia pages are now redirects and preferential treatment of URLs that start with upper-case characters, among non-KB pages. This component remained English-specific.

Cross-Mapper The cross-mapper is a new, multi-lingual component which groups together all Wikipedia articles corresponding to the same English counterpart, by mapping them to the canonical English Wikipedia URL. It is also English-centric — as is the KB — since it ignores clusters of parallel Wikipedia articles that aren’t available in English. GOOG Dictionary The GOOG “dictionary” disambiguates a string by querying the Google search engine in English (hl=en), with the site:en.wikipedia.org directive, scoring any returned URLs beginning with http://en.wikipedia.org/wiki/ using the inverses of their ranks (Agirre et al., 2009, §2.4). Our first language-independent baseline is a simple modification of GOOG, which drops hl=en, relaxes the restriction to just site:wikipedia.org and keeps not only the English but now also any non-English Wikipedia pages covered by the cross-mapper; scores for canonical English pages hit multiple times are simply added. We used our new multi-lingual dictionaries in the same way as for our English-specific entity linking submission, by focusing on highest-scoring entries, with a simple NILclustering strategy (Chang et al., 2011, §1, §4.1).1 Although it is easy to implement, the GOOG dictionary offers a fairly weak baseline, scoring some two-and-a-half points below the median entry in this year’s competition (see Table 1). KB MicroAve GOOG 69.7 English EXCT→LNRM 71.4 median EXCT→LNRM 74.5 highest

B 3 F1 65.0 66.0 67.5 69.5 78.8

2011 System Stanford2-3 Stanford2-1 Stanford2-2

Table 1: Stanford2 results for cross-lingual entity linking.

1

This time, however, we correctly assigned a unique NIL identifier to each distinct string mention (strategy N1), instead of accidentally making each NIL unique to the query (strategy N2).

English EXCT→LNRM Dictionary Next, we made several improvements to our core English dictionary, which is based primarily on the anchor-texts of web-links: both internal inter-Wikipedia links and external links from the web into the English Wikipedia (Agirre et al., 2009, §2.2). First, we created a new view of external, non-Wikipedia links into the English-Wikipedia, according to the August 2nd, 2011 Google web crawl. And second, we introduced a number of additional relevant boolean features, to augment the raw counts (in the end, we did not use these features in our submission).2 Thus, our new, stillmonolingual dictionary contained more and fresher stringto-article mappings than the original version from 2009. We used the refreshed English dictionary in our standard cascading way, going by exact matches (EXCT) whenever possible and falling through to more forgiving matching (LNRM) if needed (Chang et al., 2011, §2.1). This new monolingual dictionary scored one point higher than GOOG, but still one-and-a-half points lower than the median entry, in the cross-lingual evaluation (see Table 1). Cross-Lingual EXCT→LNRM Dictionary Finally, we created the cross-lingual dictionary by incorporating a new kind of information: anchor-texts from nonWikipedia web-pages into non-English-Wikipedia pages covered by our cross-mapper. This gave us a stream of indirect web-counts, as if their anchor-texts had come from direct links to the corresponding canonical English Wikipedia pages. To counter-balance this additional weight of web-links, we also created a new view of inter-EnglishWikipedia links, according to the same crawl, complementing the information from the 2008/9 Wikipedia dumps that closely resemble the KB but may have now become stale. Our new, multi-lingual dictionary performed significantly better than its monolingual counterpart, scoring two points higher than the median entry in the 2011 cross-lingual entity linking competition (see Table 1). We believe that it offers a surprisingly strong baseline, considering that it uses neither context nor any knowledge specific to Chinese.

Further Monolingual Evaluation To get a better sense of the multi-lingual dictionary’s quality, we tested it on all three English evaluation sets, using both exact lookups and our usual cascade of dictionaries. The new dictionary scored well above the median and not far below the highest entry on the 2009 evaluation set (see Table 2a); higher than the highest entry that did not access Wikipedia pages associated with KB nodes in inference in 2010 (see Table 2b); and again higher than the median but lower than the highest entry among no-wiki-text submissions in 2011 (see Table 2c).3 Exact lookups were slightly — but consistently — worse than our cascade strategy. 2

We intend to explain all features in a sister paper (Spitkovsky and Chang, 2012) that is to accompany the public release of our cross-lingual dictionaries and newest associated components. 3 Note that our approach would qualify as no-wiki-text, since it does not make use of the text of the Wikipedia article in question (though it may use anchor text from other Wikipedia pages).

a) 2009

b) 2010

c) 2011

KB MicroAve median 71.1 EXCT 79.4 79.5 EXCT→LNRM highest 82.2 no-wiki-text-median median no-wiki-text-highest EXCT 82.3 EXCT→LNRM 82.9 highest no-wiki-text-median EXCT 70.0 EXCT→LNRM 71.2 no-wiki-text-highest median highest

B 3 F1 64.9 65.2 63.5 68.4 77.9 78.8 79.5 86.8 52.1 67.4 68.6 71.4 71.6 84.6

Table 2: Results for all three English entity linking tasks.

Conclusions We described Stanford’s knowledge-base population system for the cross-lingual entity linking task. Our multilingual dictionary uses neither context nor languagespecific knowledge, yet performs better than the median scorers on all available TAC-KBP entity linking evaluation sets. Despite its simplicity, the static dictionary presents a surprisingly strong baseline to the research community, as well as possibly a useful platform for developing more sophisticated context-sensitive, machine learning approaches. We are currently in the process of publicly releasing this resource and related data (Spitkovsky and Chang, 2012).

Acknowledgments This work was carried out in the summer of 2011, while both authors were employed full time at Google Inc., over the course of the second author’s internship. We would like to thank our advisors at Stanford University, Dan Jurafsky and Chris Manning, for their continued help and support. We are also grateful to the other members of the original Stanford-UBC TAC-KBP entity linking team: Eneko Agirre and Eric Yeh; our initial (monolingual) dictionary for mapping strings to Wikipedia articles was conceived and constructed during that collaboration, in the summer of 2009. We thank the task organizers for their effort.

References E. Agirre, A. X. Chang, D. S. Jurafsky, C. D. Manning, V. I. Spitkovsky, and E. Yeh. 2009. Stanford-UBC at TAC-KBP. In TAC. A. X. Chang, V. I. Spitkovsky, E. Yeh, E. Agirre, and C. D. Manning. 2010. Stanford-UBC entity linking at TAC-KBP. In TAC. A. X. Chang, V. I. Spitkovsky, E. Agirre, and C. D. Manning. 2011. Stanford-UBC entity linking at TAC-KBP, again. In TAC. H. Ji, R. Grishman, H. T. Dang, K. Griffitt, and J. Ellis. 2010. Overview of the TAC 2010 Knowledge Base Population track. In TAC. P. McNamee and H. Dang. 2009. Overview of the TAC 2009 Knowledge Base Population track. In TAC. V. I. Spitkovsky and A. X. Chang. 2012. A cross-lingual dictionary for English Wikipedia concepts. In LREC.

Strong Baselines for Cross-Lingual Entity Linking - Stanford NLP

documents with articles in a knowledge base (KB). In the two earliest TAC-KBPs, the KB was a subset of the English. Wikipedia, and the documents were also in ...

32KB Sizes 0 Downloads 254 Views

Recommend Documents

Strong Baselines for Cross-Lingual Entity Linking - Stanford NLP Group
managed to score above the median entries in all previ- ous English entity ... but now also any non-English Wikipedia pages covered by the cross-mapper ...

Stanford-UBC Entity Linking at TAC-KBP - Stanford NLP Group
Computer Science Department, Stanford University, Stanford, CA, USA. ‡ .... Top Choice ... training data with spans that linked to a possibly related entity:.

Stanford-UBC Entity Linking at TAC-KBP, Again - Stanford NLP Group
Computer Science Department, Stanford University, Stanford, CA, USA. ‡ ... into a single dictionary, to be used down-stream, as in the previous years; the second, a heuristic ... 5. discard low in-degree pages, unless one of the following is true:.

A Comparison of Chinese Parsers for Stanford ... - Stanford NLP Group
stituent parser, or (ii) predicting dependencies directly. ... www.cis.upenn.edu/˜dbikel/download.html ... Table 2: Statistics for Chinese TreeBank (CTB) 7.0 data.

Stanford-UBC at TAC-KBP - Stanford NLP Group - Stanford University
IXA NLP Group, University of the Basque Country, Donostia, Basque Country. ‡. Computer Science Department, Stanford University, Stanford, CA, USA. Abstract.

Stanford-UBC at TAC-KBP - Stanford NLP Group - Stanford University
We developed several entity linking systems based on frequencies of backlinks, training on contexts of ... the document collection containing both entity and fillers from Wikipedia infoboxes. ..... The application of the classifier to produce the slo

Unsupervised Dependency Parsing without ... - Stanford NLP Group
inating the advantage that human annotation has over unsupervised ... of several drawbacks of this practice is that it weak- ens any conclusions that ..... 5http://nlp.stanford.edu/software/ .... off-the-shelf component for tagging-related work.11.

20140615 Entity Linking and Retrieval for Semantic Search ... - GitHub
Jun 15, 2014 - blog posts. - tweets. - queries. - ... - Entities: typically taken from a knowledge base. - Wikipedia. - Freebase. - ... Page 24 ... ~24M senses ...

20140615 Entity Linking and Retrieval for Semantic ... - WordPress.com
WiFi. - Network: Delta-Meeting. - Password: not needed(?). Page 3 ... Entity/Attribute/Relationship retrieval. - + social, + personal. - + (hyper)local ...

20140615 Entity Linking and Retrieval for Semantic Search ... - GitHub
Wikipedia Miner. [Milne & Witten 2008b]. - Open source. - (Public) web service. - Java. - Hadoop preprocessing pipeline. - Lexical matching + machine learning.

20140615 Entity Linking and Retrieval for Semantic ... - WordPress.com
Freebase. - Probabilistic retrieval model for semistructured data. - Exercises. - Entity Retrieval with a probabilistic retrieval model for semistructured data ...

Bootstrapping Dependency Grammar Inducers ... - Stanford NLP Group
from Incomplete Sentence Fragments via Austere Models. Valentin I. Spitkovsky [email protected]. Computer Science Department, Stanford University ...

Capitalization Cues Improve Dependency Grammar ... - Stanford NLP
(eight) capitalized word clumps and uncased numer- .... Japanese (which lack case), as well as Basque and ... including personal names — are capitalized in.

Capitalization Cues Improve Dependency ... - Stanford NLP Group
39.2. 59.3. 66.9. 61.1. Table 2: Several sources of fragments' end-points and. %-correctness of their derived constraints (for English). notations, punctuation or ...

Viterbi Training Improves Unsupervised Dependency ... - Stanford NLP
state-of-the-art model (Headden et al., 2009; Co- hen and Smith, 2009; Spitkovsky et al., 2009), beat previous benchmark accuracies by 3.8% (on. Section 23 of ...

20140615 Entity Linking and Retrieval for Semantic ... - WordPress.com
Lazy random walk on entity networks extracted from. Wikipedia ... The entity networks are similar, but Yahoo! ... Other “dimensions” of relevance. - recency. - interestingness. - popularity. - social ... Assume you want to build a “semantic”

Bootstrapping Dependency Grammar Inducers from ... - Stanford NLP
considered incomplete; (b) sentences with trailing but no internal punctuation ... (b) A complete sentence that can- .... tend to be common in news-style data.

A Simple Distant Supervision Approach for the ... - Stanford NLP Group
the organizers, Wikipedia, and web snippets. Our implementation .... host cities tend to be capitals, which neither follows logically, nor happens to be true, ...

A Cross-Lingual Dictionary for English ... - Stanford NLP Group
Computer Science Department, Stanford University, Stanford, CA, 94305. {valentin, angelx}@{google.com, cs.stanford.edu}. Abstract. We present a resource for ...

Three Dependency-and-Boundary Models for ... - Stanford NLP Group
Figure 1: A partial analysis of our running example. Consider the example in ..... we ran until numerical convergence of soft EM's ob- jective function or until the ...

Entity Linking in Web Tables with Multiple Linked Knowledge Bases
in Figure 1, EL aims to link the string mention “Michael Jordan” to the entity ... the first row of Figure 1) in Web tables, entity types in the target KB, and so ..... science, services and agents on the world wide web 7(3), 154–165 (2009) ...

Baselines for Image Annotation - Sanjiv Kumar
and retrieval architecture of these search engines for improved image search. .... mum likelihood a good measure to optimize, or will a more direct discriminative.

Easy Does It: More Usable CAPTCHAs - Stanford NLP Group
Apr 26, 2014 - Websites present users with puzzles called CAPTCHAs to curb abuse caused by computer algorithms masquerading as people.

Lateen EM: Unsupervised Training with Multiple ... - Stanford NLP
Lateen strategies may seem conceptually related to co-training (Blum and Mitchell, 1998). However, bootstrapping methods generally begin with some ..... We thank Angel X. Chang, Spence Green,. David McClosky, Fernando Pereira, Slav Petrov and the ano