Harnessing Linked Knowledge Sources for Topic ...

Viewer
Transcript

Harnessing Linked Knowledge Sources for Topic Classification in Social Media Amparo E. Cano

Andrea Varga

Matthew Rowe

Knowledge Media Institute The Open University, UK

Organisations, Information and Knowledge Group (OAK) The University of Sheffield, UK

School of Computing and Communications Lancaster University, UK

[email protected]

[email protected] [email protected] Fabio Ciravegna Yulan He

OAK The University of Sheffield

[email protected] ABSTRACT Topic classification (TC) of short text messages offers an effective and fast way to reveal events happening around the world ranging from those related to Disaster (e.g. Sandy hurricane) to those related to Violence (e.g. Egypt revolution). Previous approaches to TC have mostly focused on exploiting individual knowledge sources (KS) (e.g. DBpedia or Freebase) without considering the graph structures that surround concepts present in KSs when detecting the topics of Tweets. In this paper we introduce a novel approach for harnessing such graph structures from multiple linked KSs, by: (i) building a conceptual representation of the KSs, (ii) leveraging contextual information about concepts by exploiting semantic concept graphs, and (iii) providing a principled way for the combination of KSs. Experiments evaluating our TC classifier in the context of Violence detection (VD) and Emergency Responses (ER) show promising results that significantly outperform various baseline models including an approach using a single KS without linked data and an approach using only Tweets.

Keywords linked knowledge sources, violence detection, emergency response, named entities, semantic concept graphs

1.

INTRODUCTION

In recent years, social media have continued to grow in popularity and have become a powerful platform for people to unite together under common interests. Particularly Twitter has proven to be a faster channel of communication when compared to traditional media as seen by the Egyptian revolution and 2011 Japan earthquake. Therefore the real-time identification of topics discussed in these channels

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. 24th ACM Conference on Hypertext and Social Media 1–3 May 2013, Paris, France Copyright 2013 ACM

School of Engineering and Applied Science Aston University, UK

[email protected] could aid in different scenarios including i.e., violence detection and emergency response situations. However, this classification task poses different challenges including: high topical diversity; irregular and ill-formed words; and more importantly the sparsity presented in Tweets’ content along with the evolving jargon which emerges as different events are discussed. Recent research approaches ([12, 9]) have proposed to alleviate the sparsity of microposts by leveraging existing social knowledge sources. In particular, a large body of work which will be discussed in the related work section, has been proposed for the task of topic classification of Tweets. However, the majority of these approaches only employ lexical features (e.g. bag of words (BoW) or bag of entities (BoE)) extracted solely from a Tweet content. Other approaches classify Tweets into topics by enhancing a Tweet’s feature set with features obtained from a single Knowledge Source (KS). Nevertheless to our knowledge none of the existing approaches have leveraged the graph structures surrounding concepts present in a KS for the topical classification of Tweets. In this paper we therefore propose a generic and unified framework for TC of Tweets using multiple linked KSs, and evaluate it in the violence detection (VD) and emergency response (ER) domains. In contrast to existing approaches rather than focusing on lexical features derived from microposts, we propose a KS-based contextual enrichment of features. This enrichment is based on a technique we developed for deriving semantic meta-graphs from different KSs. Our approach leverages the entities appearing in a Tweet (e.g. Person, Location, Organisation), by exploiting additional contextual information about these entities’ resources present in different KSs. From this information we derived semantic features, which enhance the simple lexical feature representation of a Tweet. In previous work ([16]) we have shown that the performance of a topic classifier differs depending on the choice of the KS; arguing that different KSs may complement each other. Therefore in this work we investigate the benefit of combining and integrating the evidence of words and concepts from individual and linked KSs by following Linked Data principles. This approach results in the merging of additional semantic graphs derived from different knowledge spaces for the topic classification of Tweets.

The main research question which we investigate are the following: i) Do semantic meta-graphs built from KSs contain useful semantic features about entities for the topic classification (TC) of Tweets? To what extent do these semantic features help the violence detection (VD) and emergency response (ER) (TC) tasks? ; and ii) Which KS data and KS taxonomies (i.e. DBpedia and Yago or Freebase) provide more useful information for TC of Tweets? The main contribution of this paper are as follows: i) we propose and evaluate a novel set of semantic meta-graph features about entities for the TC of Tweets; ii) we investigate different strategies for building TC of Tweets: sole knowledge source based and combined linked knowledge source based, and show the superiority of the latter approach; iii) we propose a unified framework for harnessing the information and knowledge from multiple linked KSs for TC, showing the superiority against previous work using the sole KS approach and the sole Twitter classification; and iv) we compare the results using different ontologies (DBpedia, Yago and Freebase) for deriving semantic features for TC of Tweets within the VD&ER tasks, and show that the combined mapped ontology is the one providing the most accurate results.

2.

MOTIVATION

Social knowledge sources constitute some of the largest repositories built in a collaborative manner, providing an up-to-date channel of information and knowledge on a large number of topics. The relevance of these KSs to Twitter is apparent due to the Social Web-based platform characteristics of KSs including that: i) they are constantly edited by Web users; ii) their creation is done in a collaborative manner; and iii) they cover a large number of topics. In this work we investigate the use of two KSs, namely: DBpedia and Freebase. DBpedia1 is a KS derived from Wikipedia2 . In DBpedia [2] each resource is harvested from a Wikipedia article which is semantically structured into a set of DBpedia3 (dbpedia) and YAGO24 (yago) ontologies, with the provision of links to external knowledge sources such as Freebase, OpenCyc,5 and UMBEL6 . The latest DBpedia dump, DBpedia 3.8 classifies 2.35 million resources into dbpedia ontological classes, according to 359 distinct classes, which form a subsumption hierarchy and are described by 1,820 different properties. Conversely, the yago ontology [8] is a much bigger and fine grained ontology, containing 447 million facts about 9.8 million entities which are classified into 365,372 classes. In contrast, Freebase7 (freebase) is a large online knowledge base which users can edit in a similar manner to Wikipedia. In Freebase [3], resources are harvested from multiple sources such as Wikipedia, ChefMoz, NNDB and MusicBrainz8 along with data individually contributed by users. These resources are semantically structured into Freebase’s own ontologies, which consist of 1,450 classes and more than 7,000 unique properties. Overall, these ontologies (i.e. dbpedia, yago, freebase) enable a broad coverage of entities in the world, and allow entities to bear multiple overlapping types. One of the main ad1

DBpedia, http://dbpedia.org Wikipedia, http://wikipedia.org 3 http://wiki.dbpedia.org/Ontology 4 http://www.mpi-inf.mpg.de/yago-naga/yago/ 5 OpenCyc, http://sw.opencyc.org/ 6 UMBEL, http://www.umbel.org/ 7 Freebase, http://frebase.org 8 Freebase Datasources, http://sources.freebaseapps.com/ 2

vantages of exploiting these KSs is that each particular topic (e.g. http://dbpedia.org/page/Category:Violence) is associated to a large number of resources (e.g. ), allowing one to build a broad representation of a topic. In addition each resource is related to different ontological classes or concepts which provide additional contextual information for that resource, enabling in this way the exploitation of various semantic structures of these resources. The use of this structured knowledge enables the contextual enrichment of a Tweet’s entities by providing information that can help to disambiguate the role of a given entity in a particular context. Consider the Tweets in Figure 1, although the entity Obama has different roles such as president, Nobel laureate, husband ; the role of this entity will be defined by the contextual information provided in the content of each Tweet. Section 4 introduces our approach for leveraging this semantic contextual information by proposing the use and introducing the concept of semantic meta-graphs.

3.

RELATED WORK

Previous research on exploiting KSs for TC of Tweets can be divided into two main strands: approaches that use local metadata and approaches that exploit the link structure of the KSs. In the first case, Genc et al. [6] proposed a latent semantic topic modelling approach, which mapped each Tweet to the most similar Wikipedia articles based on lexical features extracted from Tweets’ content only. Song et al. [15] mapped a Tweet’s terms to the most likely resources in the Probbase KS. These resources were used as additional features in a clustering algorithm which outperformed the simple BoW approach. Munoz et al. [11] proposed an unsupervised vector space model for detecting topics in Tweets in Spanish. They used syntactical features derived from PoS (part-of-speech) tagging, extracting entities using the Sem4Tags tagger ([5]) and assigning a DBpedia URI for those entities by considering the words appearing in the context of the entity inside the Tweets. Vitale et al. [17] proposed a clustering based approach which augmented the BoW features with BoE features extracted using the Tagme system, which enriches a short text with Wikipedia links by pruning n-grams unrelated to the input text, showing significant improvement over the BoW features. Recently, we [16] studied the similarity between KSs and Twitter using both BoW and BoE, showing that DBpedia and Freebase KSs contain complementary information for TC of Tweets, with the lexical features achieving the best performance. Focusing on the approaches exploiting the linked structure of KSs; Michelson et al. [9] proposed an approach for discovering Twitter user’ topics of interest by first extracting and disambiguating the entities mentioned in a Tweet. Then a sub-tree of Wikipedia category containing the disambiguation entity is retrieved and the most likely topic is assigned. Milne et al. [10] also assigned resources to Tweets. In their approach they make use of Wikipedia as a knowledge source, and consider a Wikipedia article as a concept, their task then is to assign relevant Wikipedia article links to a Tweet. They propose a machine learning approach, which makes use of Wikipedia n-gram and Wikipedia link-based features. Xu et al. [18] proposed a clustering based approach which linked terms inside Tweets to Wikipedia articles, by leveraging Wikipedia’s linking history and the terms’ textual context information to disambiguate the terms meaning. Despite of the success of existing approaches, the vast majority still exploits a single KS when detecting topics in

Figure 1: Tweets exposing different contexts involving the same entitiy

Figure 2: Deriving a semantic metagraph from multiple KSs Tweets, however, recent studies indicate that KSs contain complementary information ([16]). Furthermore, although existing approaches ([11, 16]) consider entities’ metadata when detecting topics in Tweets, the information is constrained by the used NER service (e.g. OpenCalais9 or Tagme10 ), which often returns generic entity types ([14]) ignoring more fined grained semantic information described in external KSs. In contrast to previous work, we present an approach which exploits the semantic structure of multiple linked KSs to gauge a more fine grained role of an entity in a specific topic by proposing the use of semantic meta-graphs.

4.

FRAMEWORK FOR TOPIC CLASSIFICATION OF MICROPOSTS

The proposed approach for building a topic classifier for Tweets consists of four main stages: i) datasets collection; ii) datasets enrichment (both Tweets and KSs derived); iii) semantic features derivation; and iv) building a topic classifier based on features derived from crossed-sources; depicted in Figure 3. In the first stage, data collection, data from both Twitter and KSs is retrieved. The Twitter dataset comprises a set of

topically annotated Tweets. Conversely, the KSs dataset is built from a set of articles relevant to a given topic extracted from multiple KSs. This study considers two KSs namely DBpedia (DB) and Freebase (FB), which are applied both independently and merged. Therefore we consider three scenarios for the use of these KSs’ datasets: i) DB - from DBpedia only; ii) FB - from Freebase only; and iii) DB-FB from both DBpedia and Freebase (see Section 5). The second stage, datasets enrichment, performs two main steps: (i) entity extraction - relying on OpenCalais and Zemanta11 services for name entity recognition; and (ii) semantic mapping - where the obtained named entities are mapped to their KSs resource counterpart if exist12 . The third stage, semantic features derivation, consists of leveraging the semantic information about the extracted entities within the different KSs. This stage comprises two steps: (i) semantic meta-graphs construction and (ii) semantic feature augmentation; which are discussed in the following subsections.

4.1 11

Zemanta, http://zemanta.com Following this process, the percentage of entities without a deferenced URI are 35% in DBpedia, 40% in Freebase, and 36% in Twitter

12 9 10

OpenCalais, http://www.opencalais.com Tagme, http://tagme.di.unipi.it/

Deriving Semantic Meta-Graphs

The semantic mapping of a named entity into a KS re-

FB

TW

Derive Semantic Features

Retrieve Tweets

DB

Annotate Tweets

DB-FB

Retrieve Articles

Concept Enrichment

Build Cross-Source Topic Classifier

Figure 3: Architecture of cross-source TC using semantic features. source, incorporates a rich semantic representation. Figure 2 presents an extract of the semantic properties and classes for the entity “Barack Obama”. In this work, rather than focusing on the instances associated with a resource, we focus on each triple’s semantic structure at a meta-level, and for that we introduce the meta graph definition as follows. Definition 1 (Resource Meta Graph) is a sequence of tuples G := (R, P, C, Y ) where • R, P, C are finite sets whose elements are resources, properties, and classes; • Y is the ternary relation Y ⊆ R×P×C representing a hypergraph with ternary edges. The hypergraph of a Resource Meta Graph Y is defined as a tripartite graph H (Y) = hV, Di where the vertices are V = R ∪ P ∪ C, and the edges are: D = {{r, p, c} | (r, p, c) ∈ Y }. A resource meta-graph provides information regarding the set of ontologies and properties used on the semantic definition of a given resource. The meta-graph of a given entity e can be represented as the sequence of tuples G(e) = (R, P, C, Y 0 ), which is the aggregation of all resources, properties and classes related to this entity. This definition serves as a formal representation of those triples related to an entity. This representation enables to build upon subgraphs of this graph. For example, we introduce two further notations: R(c) = {e1 , . . . , en } for referring to the set of all entity resources whose rdf:type is class c; and R0 (c) = {e1 , . . . , em } for denoting the set of entity resource whose type are specialisations of c’s parent type (i.e. resources whose rdf:type are siblings of c). This process results in a semantic concept graph which associates each entity with the corresponding semantic ontological classes (or concepts) involving this entity. In light of the proposed three KS scenarios we construct three different semantic meta-graphs: (i) one from DB using the dbpedia and yago ontologies; (ii) one from FB using the freebase ontology; and (iii) another one from DB-FB using the joint ontologies. For the joint scenario we use the concepts from dbpedia ontology together with the the classes obtained after mapping the yago and freebase ontologies13 . Once a semantic meta graph has been constructed for a given entity, two main features can be derived from it, namely the 13

The mapping of Freebase entity classes to the most likely yago classes was done by a combined element and instance based technique (www.l3s.de/~demidova/students/ master_oelze.pdf) and is available at http://iqp.l3s. uni-hannover.de/yagof.html

class and property features. These features provide additional contextual information regarding an entity and are described as follows: C: Semantic class features: This feature set consists of all the classes appearing on the semantic meta graph of a given entity. This set captures fine-grained information about this entity. For e.g. for Barack Obama these features would be yago:PresidentsOfTheUnitedStates, freebase:/book/author, yago:LivingPeople, and dbpedia:Person. Our main intuition is that the relevance of an entity to a given topic could be inferred from an entity’s class type. For example the class yago:PresidentsOf-TheUnitedStates could be considered more relevant to the topic “Violence”, than the class yago:Singer. P: Semantic class-property features: This feature set captures all the properties appearing on the semantic meta graph of a given entity. Our intuition is that given a context, certain properties of an entity can be more indicative of this entity’s relevancy to a topic than others. For example, given the role of the Tahrir Square in the Egyptian revolution, properties such as dcterms:subject could be more topically informative than geo:geometry. The relevance of a property to a given topic can be derived from the semantic structure of a KS graph by considering the approach proposed in Subsection 4.2.

4.2

Weighting Semantic Features

In order to capture the relative importance of each feature in a semantic meta-graph we proposed two different weighting strategies. These strategies are based on the generality and specificity of a feature in a given semantic meta-graph. W-Freq: Semantic Feature Frequency: A light-weight approach for weighting the ontological class and property features enhancing the feature space of a document (i.e KSs’ article or tweet) x is to consider all the semantic metagraphs extracted from the entity resources appearing in this document. We define the frequency of a semantic feature f in a given document x with Laplace smoothing as follows: SF Fx (f ) =

Nx (f ) + 1 P , |F | + f 0 ∈F Nx (f 0 )

(1)

where Nx (f ) is the number of times feature f appears in all the semantic meta-graphs associated to document x; and F is the semantic features’ vocabulary. This weighting function captures the relative importance of a document’s semantic features against the rest of the corpus; while the normalisation prevents bias towards longer documents.

While the W-Freq (semantic feature frequency) weighting function depends on the occurrences of features in a particular document, other generalised weighting information can be derived from a KS semantic structure to characterise a semantic meta-graph. To derive a weighted semantic metagraph we propose the following W-SG weighting strategy. W-SG: Class-Property Co-Occurrence Frequency: The rationale behind this novel weighting strategy is to model the relative importance of a property p (e.g. dbpediaOwl:ground ) to a given class c (yago: MiddleEasternCountries), together with the generality of the property in a KS’s graph. We propose to compute how specific and how general is a property to a given class based on a set of semantically related resources derived from a KS’s graph. Taking into account the notations introduced in Subsection 4.1, given the semantic meta-graph of an entity e (i.e. G(e)), we derive the relative importance of a property p ∈ G(e) to a given class c ∈ G(e) in a KS graph Gks by first defining the specificity of p to c as follows:

case the generality of property dbpedia-owl:ground given the class yago:MiddleEasternCountries for the DB graph is computed as: generality( dbpediaOwl:ground, yago:MiddleEasternCountries ) = {| < yago : M iddleEasternCountries rdf:subClassOf ?parent >, >∈ G DB|}/ {| ∈ G DB|}

Higher generality values indicate that a property spans over multiple classes, and is less specific to a given class c. We combine these two measures (generality and specificity) of a a property p to a given class c as follows: SG(p, c) = specificity(p, c) × generality(p, c)

specificityks (p, c) =

Np (R(c)) , N (R(c))

(2)

where Np (R(c)) is the number of times property p appears in all resources of type c in the KS graph Gks , and N (R(c)) is the number of resources of type c in Gks . This measure captures the probability of the property p being assigned to an entity resource of type c. For example if we consider the Iran entity semantic meta-graph and its dbpediaOwl:ground property and yago:MiddleEasternCountries class, then the specificity value of dbpediaOwl:ground in the DBpedia graph GDB is computed as follows: specificity DB(dbOwl:ground, yago:MiddleEasternCountries) = {| , ∈ G DB|}/ {| ∈ G DB|}

As indicated in equation 2, the computation of the specificity value is independent of the entity e and differs according to the KS graph from which it is derived14 . Higher specificity values indicate that the property p occurs frequently on resources of the given class c. Conversely, the generality measure captures the specialisation of a property p to a given class c, by computing the property’s frequency among other semantically related classes R0 (c). We define the generality measure of a property p to a class c in a KS graph GKS , as follows: generalityKS (p, c) =

N (R0 (c)) , Np (R0 (c))

(3)

where N (R0 (c)) is the number of resources whose type is either c or a specialisation of c’s parent classes. This measure captures the relative generalisation of a property p to a broader set of specialised sibling classes derived from c, and its computation is independent of the entity e. In this 14

It might note mentioning that for each entity resource the specificity values for the properties are the same, capturing in this way the generalisation of the property for the same concept type.

4.3

Enhancing a TC’s Feature Space with Semantic Features

We investigated two different strategies for incorporating semantic features into the feature space used for training SVM topic classifiers:

4.3.1

Semantic augmentation

This strategy augments the traditional lexical features (e.g BoW and BoE features) with additional semantic information extracted for the entities appearing in a document. Considering the C (i.e. semantic class features) feature set introduced in Subsection 4.1, we extend the feature set F 0 into FA1−C by adding the class features extracted from the aggregation of the semantic meta-graphs of those entities appearing in the document x. Therefore the expanded feature 0 | = |F |+|Fc | where |Fc | denotes the vocabulary size is |FA1−C total number of unique class features. For the P (i.e. semantic class-property features) feature set we extend the feature 0 by adding the property features extracted set F into FA1−P from the aggregation of the semantic meta-graphs of the entities appearing in the document x. Therefore the expanded 0 | = |F |+|Fp | where |Fp | defeature vocabulary size is |FA1−P notes the total number of unique property features. For the combine C+ P feature set this augmentation strategy cre0 , in which the feature set ates the novel feature set FA1−C+P F is expanded with the properties’ < p, c > tuple features derived from the semantic meta-graphs. In this case, the size 0 of the expanded feature set is: |FA1−C+P | = |F |+|Fp |×|Fc |.

4.3.2

Semantic augmentation with generalisation

This augmentation strategy aims to further improve the generalization of a TC by exploiting the subsumption relation among classes within the DBpedia or Freebase ontologies. Therefore in this strategy, instead of using the typeOf class c of an entity, we consider more generic classes of c namely the set of parent classes of c(parent(c)). In this case the feature set F is enhanced with the set of parent classes of c where c ∈ C. Therefore the size of the enhanced fea0 0 ture set FA2−C is computed as |FA2−C | = |F | + |Fparent(c) |, where |Fparent(c) | denotes the total number of unique parent 0 classes of c. Similarly, the enhanced feature set FA2−C+P which uses the C+P features is build by adding the < 0 p, parent(c) > tuple features. The size of the FA2−C+P

0 is therefore: |FA2−C+P | = |F | + |Fp | × |Fparent(c) |, where |Fparent(c) | denotes the total number of unique parent(c) classes derived from a GKS . After generating the expanded feature sets, the last stage (as indicated in Figure 3) consists on training a classifier considering the enhanced feature space. Our approach was evaluated on the Violence Detection and Emergency Response domains. The following section introduces the datasets in which this framework was tested.

5.

DATASET

To analyse the impact of utilising semantic features in TC of Tweets, we evaluated the performance of the proposed strategies using a large corpus of Tweets and two large coverage linked KSs, namely DBpedia and Freebase. Since we evaluated our framework in the context of VD and ER, we used a sample dataset relevant to these domains. The Twitter dataset (TW) was derived from Abel et al.’s dataset [1] comprising Tweets collected from over a period of two months starting on November 2010. This dataset has been topically annotated and has been used in previous work for approaching the TC of Tweets task in [16]. The topical annotations of these Tweets include the following topic labels: “War & Conflict” (War ), “Law & Crime” (Cri) and “Disaster & Accident” (DisAcc). We manually re-annotated this collection of Tweets, ensuring to have 1,000 Tweets for each of these topic labels. These Tweets served as positive examples for each topic in this dataset. In order to mimic the imbalance issue posed on the detection of Tweets in this domain, in which a large proportion of microposts in a stream might be irrelevant to the topic of interest; we built a negative dataset comprising a large collection of Tweets which do not bear any relation to these three topics (i.e. War, DisAcc and Cri). The final Twitter dataset, is a multilabel dataset comprising 10,189 Tweets, being highly multilabel, with Tweets annotated with up to six topic labels. Some notable events related to violence and ER discussed within these datasets include among others the “Mexican drug war”, “Egyptian revolution”, “Iranian Stoning Sentence”, and “Indonesia Volcano Eruption”. We built the DBpedia and Freebase topic datasets by SPARQL querying these endpoints for all resources belonging to categories and subcategories of the skos:concepts of War, DisAcc and Cri respectively; keeping the resource’s abstract or title as a document labelled with the given topic. Following this process, the final DBpedia dataset comprises of 9,465 articles, and the Freebase dataset consists of 16,915 articles [16].

5.1

Datasets Pre-processing

Here we summarise the pre-processing steps performed preceding the application of TC classifiers. Lexical Features: In order to obtain the BoW feaures, for each document (i.e KS-derived article or tweet) we removed stopwords; converted all words into lower case and applied the Lovins stemmer. In addition, we removed all Twitterspecific hashtags, mentions and URLs, to reduce the vocabulary differences between the KSs and TW datasets. The feature spaces were also reduced to the top-1000 words weighted by TF-IDF for each topic. Next, following the steps introduced in Section 4, we proceeded to perform the data enrichment on the KSs and TW datasets for deriving the BoE features, by using OpenCalais and Zemanta NER services.

Obtaining Semantic Features: Using the BoE features derived from a document, we SPARQL queried for each entity’s resource in DBpedia and Freebase. From these resources we built the semantic meta-graphs from each KS graph (i.e. GDB and GF B ) as indicated in Subsection 4.1. In addition, from both DBpedia and Freebase KS graphs we disregarded some properties containing general information about an entity (i.e common to each instance) e.g. rdfs:comment, abstract, wikiPageExternalLink from DBpedia and type/object from Freebase. These feature spaces were also reduced by considering for each entity type defined by OpenCalais (e.g. Person) the top 5 entity classes and top 5 properties derived from the different KS graphs. The statistics of the lexical and semantic features derived for these datasets are summarised in Table 1. In addition, we mention here that the DBpedia dataset contains the highest number of entities for each topic, on average 22.24 entities per articles; while the number of documents without any entity is 69 (0.72%). In the case of Freebase, the average number of entities per article is 8.14, and the percentage of articles without any entity is 19.96% (3,377 articles). Lastly, the Twitter dataset consists of informative Tweets mentioning at least one entity, the average number of entities per tweet being 1.73.

6.

BASELINE FEATURES

We compared the performance of the evaluated topic classifiers based on the proposed semantic feature augmentation strategies against several baseline models corresponding to state-of-the-art approaches for TC. Bag-Of-Unigram (BoW) Features: The unigram features captures our natural intuition to utilise what we know about a particular topic, so that the features, which are most indicative of a topic, can be detected and the appropriate label(s) assigned. Models trained on unigram features showed to perform well on cross-source TC, outperforming on average the BoE features presented below [16]. The BoW features consist of a collection of words weighted by TF-IDF (term frequency-inverse document frequency) capturing the relative importance of a word in a document to it’s use on the whole corpus. Bag-Of-Entity (BoE) Features: These features make use of entities and concepts extracted using available annotation services, e.g. OpenCalais API, weighted by TF-IDF (fBoE (BarackObama ∧ P erson)). These web services annotate each entity with generic types. For e.g. in the case of “Barack Obama” rather than recognising it as being of type dbpedia:President the majority of these services will annotate this entity with the label Person ([14]). Part-of-Speech (POS) Features: These features leverage syntactical patterns occurrances in a document, and the relevance of these patterns to characterise a topic. In this work we used Ritter et al.’s Twitter NLP Tools [13], whose POS tagger has been trained for short text messages. Similarly to the BoW and BoE features, we weighted each POS tag using TF-IDF.

7.

EXPERIMENTAL SETUP

We performed a series of experiments to investigate the impact of the use of semantic features and linked KSs for the task of TC of Tweets in the context of VD and ER. In

DB

Semantic

Lex

Statistics

FB

TW

DisAcc

Cri

War

DisAcc

Cri

War

DisAcc

Cri

War

BoW BoE

8,837 18,247

8,837 18,247

8,504 18,167

2,078 1,172

4,596 2,715

2,574 1,822

3,218 1,818

3,197 1,816

2,781 2,146

dbClass yagoClass dbprop cls/ent prop/ent

119 3,865 4,105 4.56 26.56

119 3,865 4,105 4.56 26.56

124 3,864 4,215 4.48 26.29

39 351 1,229 5.55 39.65

47 834 1,849 4.21 33.97

48 922 1,871 6.33 41.78

80 1,480 2,544 5.73 36.99

85 1,795 2,457 6.02 32.62

68 1,275 2,422 5.80 36.17

fbClass fbprop fbcls/ent fbprop/ent

1,289 1,090 7.30 10.08

1,289 1,090 7.30 10.08

1,215 1,065 7.12 9.76

394 420 15.89 23.44

713 586 12.68 17.06

641 554 15.57 23.05

881 834 11.98 16.93

915 869 11.66 16.65

772 696 12.49 17.97

Table 1: Statistics about the DB, FB, and TW datatasets used in the context of VD and ER. The BOW and BOE represents the size of vocabulary of the BOW and BOE features. dbClass, yagoClass and f b stand for the unique number of classes extracted from the DB, FB and DB-FB knowledge graphs. dbprop counts the number of unique dbpedia properties, correspondingly f bprop counts the number of unique freebase properties. cls/ent refers to the average number of dbpedia and yago classes per entity; while f bcls/ent denotes the average number of freebase classes per entity. Similarly prop/ent denotes the average number of dbpedia and yago properties per entity, and f bprop/ent refers to the average number of freebase properties per entity. After concept generalisation, the number of unique dbClass classes reduces with 76%, the number of unique yagoClass classes reduces with 92%, and the number of unique fbClass classes with 88%.

TW(db+yago+fb)

TW(db+yago)

TW(fb)

Dataset

Features

P

R

F1

P

R

F1

P

R

F1

War

BOW POS BOE C(Freq) parent(C)(Freq) P(Freq) C+P(SG) parent(C)+P(SG) P(SG)

0.867 0.844 0.857 0.864 0.859 0.874 0.869 0.871 0.885

0.743 0.757 0.761 0.727 0.734 0.743 0.746 0.745 0.777

0.800 0.798 0.806 0.790 0.792 0.803 0.803 0.803 0.828

0.867 0.844 0.857 0.867 0.862 0.872 0.880 0.868 0.885

0.743 0.757 0.761 0.736 0.730 0.739 0.748 0.745 0.759

0.800 0.798 0.806 0.796 0.791 0.800 0.808 0.802 0.817

0.867 0.844 0.857 0.873 0.874 0.869 0.868 0.873 0.881

0.743 0.757 0.761 0.744 0.743 0.742 0.749 0.754 0.759

0.800 0.798 0.806 0.803 0.803 0.800 0.804 0.809 0.816

Cri

BOW POS BOE C(Freq) parent(C)(Freq) P(Freq) C+P(SG) parent(C)+P(SG) P(SG)

0.715 0.667 0.736 0.705 0.716 0.711 0.709 0.716 0.729

0.521 0.541 0.534 0.518 0.523 0.525 0.521 0.522 0.547

0.602 0.597 0.619 0.597 0.604 0.604 0.601 0.604 0.625

0.715 0.667 0.736 0.714 0.723 0.712 0.712 0.709 0.716

0.521 0.541 0.534 0.516 0.518 0.524 0.517 0.521 0.534

0.602 0.597 0.619 0.599 0.603 0.604 0.599 0.601 0.612

0.715 0.667 0.736 0.715 0.724 0.718 0.717 0.716 0.731

0.521 0.541 0.534 0.525 0.523 0.524 0.522 0.526 0.532

0.602 0.597 0.619 0.605 0.607 0.606 0.604 0.607 0.616

DisAcc

BOW POS BOE C(Freq) parent(C)(Freq) P(Freq) C+P(SG) parent(C)+P(SG) P(SG)

0.800 0.746 0.798 0.790 0.793 0.779 0.799 0.810 0.808

0.637 0.652 0.670 0.636 0.634 0.620 0.629 0.629 0.656

0.709 0.696 0.728 0.705 0.705 0.690 0.704 0.708 0.724

0.800 0.746 0.798 0.800 0.799 0.793 0.804 0.804 0.811

0.637 0.652 0.670 0.632 0.632 0.636 0.635 0.636 0.644

0.709 0.696 0.728 0.707 0.706 0.706 0.709 0.710 0.718

0.800 0.746 0.798 0.792 0.795 0.797 0.797 0.797 0.800

0.637 0.652 0.670 0.631 0.635 0.628 0.630 0.637 0.646

0.709 0.696 0.728 0.703 0.706 0.703 0.704 0.708 0.715

Table 2: The performance of the TW SVM TC classifiers using lexical (BOW, POS, BOE) and semantic features derived from DB (T W (db + yago)), FB (T W (f b)) and DB-FB (T W (db + yago + f b)) KS graphs. The values highlighted in bold corresponds to the best results obtained for the lexical and semantic features in terms of F1 measure for each scenario.

the first set of experiments, we compared the performance of cross-source SVM TCs derived for the proposed crosssource scenarios (i.e. DB, FB, DB-FB) and an SVM TC built on Tweets only (TW) using the baseline lexical features introduced in Section 6 and the proposed semantic features and weighting strategies introduced in Subsections 4.1 and 4.2 respectively. The main research questions that we aim to address is Do semantic meta-graphs built from KSs contain useful semantic features about entities for the TC of Tweets? To what extent do these semantic features help the VD and ER (TC) tasks?. In the second set of experiments, we investigated the ben-

efit of using linked KSs for TC of Tweets. For this reason, we compared the performance of the cross-source TCs using KS data alone and combined. In addition, we also considered a joint scenario leveraging data from both KS and Twitter data (which we refer to KS + T W ). In these experiments we address the research question of Which KS data and KS taxonomy (i.e. DBpedia and Yago or Freebase) provide more useful information for TC of Tweets? For evaluating the performance of the TC classifiers, we trained the TW TC classifier on 80% of the Twitter data, the cross-source TC classifiers for the KS scenario on the full KS data only, while the cross-source TC classifier for the KS + T W scenario on the full KS data combined with 80%

of Twitter data. We evaluated each of these TC classifiers on 20% of Twitter data over five independent runs. Furthermore, in order to provide an insight into the factors contributing to the performance of a TC, we examined the correlation between the topic-class, entity-class and topicproperty entropy values and the accuracy of the TC classifiers. We computed the three metrics for each topic as follows: 1. Topic-Class bag entropy (Class-Entropy): We took the class bag for each topic derived from the KS graphs and measured the entropy of that class bag, capturing the dispersion of classes used for a particular topic. In this context, low entropy indicates a focused topic, while high entropy indicates an unfocused topic which is more random in the subjects that it discusses. We define this measure P T | as follows: HT (C) = − |C j=1 p(cj ) log p(cj ), where p(cj ) denotes the conditional probability of a concept cj within the topic’s concept bag CT . 2. Entity-Class entropy (Entity-Entropy): We computed this measure for each topic, by considering the entity bags for each class mentioned in a topic based on the extracted KS graphs. This measure captures the dispersion of the entities in each class. That is, low entropy indicates that the topic is less ambiguous, consisting of entities belonging to few classes, while high entropy refers to higher ambiguity at the level of entities: P|EC | HCT (E) = − j=1T p(ej ) log p(ej ), where p(ej ) denotes the conditional probability of an entity ej within the class’s entity bag of the topic ECT . 3. Topic-Class-Property entropy (Property-Entropy): We measured this by taking the property bag for each class appearing in each topic derived from the KS graphs, capturing the dispersion of properties used in a topic. In this context, low entropy indicates that a topic is dominated by few class-properties, while high entropy reveals high property diversity. The corresponding measure is defined P T | as followed: HT (P ) = − |P j=1 p(pj ) log p(pj ), where p(pj ) denotes the conditional probability of a property pj , within the topic’s property bag PT .

7.1

Comparison of different feature sets for TC

Table 2 summarises the results obtained for the TW classifier for both lexical (i.e. BoW, POS and BoE) and semantic features constructed from the individual (DB, FB) KSs and the combined (DB-FB) KSs. We can observe that among the collection of features, the best results were obtained for the combined (T W (db + yago + f b)) scenario using the P features with the W-SG weighting strategy, which significantly outperforms the baseline lexical features (t-test with α < 0.05). For example for the War category, the F1 measure increases with 2.8% with respect to the BOW features and 2.2% with respect to the BOE features; for the Cri category the F1 measure increases with 2.3% with respect to the BOW feature and 0.6% with respect to the BOE features, while in the case of DisAcc an improvement of 1.5% over the BOW features can be observed. Our novel class-property co-occurance weighting schema (W-SG) for the properties (P(SG)) also show a significant improvement over the feature frequency strategy (P(Freq)) (t-test with α < 0.01).

This indicates that encoding the importance of the property for a given concept within the KS improves the generality of the properties and the performance of the TC classifier for each topic. When comparing the results obtained for the different taxonomies, we observe that the semantic features derived from the db + yago ontologies provide a significant improvement over the semantic features derived from f b ontology for the War and DisAcc topics, except for Cri (t-test with α < 0.05). An explanation for this could be that in the Cri topic the entities extracted from the DB KS graph are more ambiguous than those found within the War and DisAcc topics (see cls/ent values in Table 1). In a similar manner, the entities extracted from the FB KS are more ambiguous in the Cri topic than in the other two topics (see f bcls/ent values in Table 1). For each topic, however, the best overall results were obtained by the combined db + yago + f b ontology (obtained by augmenting the db classes with the mapping between yago and f b classes) together with the db and f b properties, indicating that the three ontologies contain complementary information (e.g. properties) about the entities. Inspecting the augmentation strategies for the C and C+P feature sets we noticed that for the f b ontology the augmentation strategy using generalisation (parent(C)(Freq) and parent(C) + P(SG)) showed a consistent improvement over the initial non-generalisation case (C(Freq) and C+ P(SG)) for each topic; however for the db + yago ontology encoding the very specific classes of the entities were found more beneficial for some topics (e.g. War ). This can be understandable because after generalisation, the entities which have the same parent class in the KS graphs, will be unified to the same semantic concept type, loosing in this way the very specific meaning of the entity. In the case of yago ontology, the number of unique classes reduces with 92% after generalisation, while in f b, the number of unique classes become 88% less (see Table 1). Comparing these results with the ones obtained for the cross-source scenarios in Table 3, we can notice that the best overall results were also obtained by the P feature with WSG weighting, which significantly improved over the baseline features for all the tree topics (t-test with α < 0.05). These results are thus further validating the benefit of incorporating semantic feature from KS graphs in both TW TC and cross-source TC. To provide an insight into the factors contributing to the performance of the TC classifiers, we computed the Pearson correlation values between the three Class-, Entity- and Property Entropy metrics (presented in Section 7) and the performance of the TC classifiers. Looking at these values in Figure 4 we find that the Entity-Entropy provides the highest correlation (over 61% in absolute terms for two topics) with the accuracy achieved. These results indicate, that as the number of ambiguous entities increases in a topic (the entity-entropy increases), the performance of the TC classifiers decrease. In conclusion, considering the results obtained for the various semantic features derived from the three KS graphs our findings are as follows: 1. Semantic meta-graphs built from KSs contain useful semantic features about entities for the TC of Tweet. In particular, incorporating semantic features about properties (P) using our novel class-property cooccurrence weighting schema (W-SG) proved a significant improvement over previous state-of-the-art ap-

DB+FB

DB+FB+TW

DB

DB+TW

FB

FB+TW

Dataset

Features

P

R

F1

P

R

F1

P

R

F1

P

R

F1

P

R

F1

P

R

F1

War

BOW POS BOE C(Freq) parent(C)(Freq) P(Freq) C+P(SG) parent(C)+P(SG) P(SG)

0.420 0.217 0.903 0.370 0.426 0.364 0.422 0.406 0.902

0.011 0.006 0.007 0.009 0.011 0.009 0.011 0.011 0.006

0.022 0.013 0.014 0.017 0.022 0.017 0.021 0.021 0.013

0.955 0.952 0.842 0.957 0.957 0.956 0.956 0.955 0.967

0.861 0.880 0.761 0.878 0.880 0.871 0.864 0.863 0.879

0.905 0.914 0.799 0.916 0.917 0.911 0.908 0.907 0.921

0.208 0.258 0.490 0.221 0.206 0.222 0.195 0.244 0.303

0.049 0.034 0.040 0.045 0.047 0.054 0.043 0.040 0.062

0.080 0.061 0.080 0.075 0.077 0.086 0.071 0.069 0.103

0.877 0.859 0.856 0.881 0.877 0.876 0.878 0.874 0.874

0.723 0.744 0.754 0.720 0.727 0.717 0.726 0.716 0.731

0.793 0.797 0.802 0.792 0.795 0.789 0.795 0.787 0.796

0.678 0.597 0.767 0.678 0.678 0.683 0.673 0.683 0.670

0.136 0.148 0.12 0.136 0.136 0.136 0.136 0.136 0.136

0.226 0.237 0.207 0.226 0.226 0.227 0.226 0.227 0.226

0.851 0.809 0.753 0.844 0.846 0.845 0.844 0.844 0.850

0.722 0.746 0.801 0.699 0.718 0.712 0.723 0.715 0.732

0.781 0.776 0.776 0.765 0.776 0.773 0.779 0.774 0.787

Cri

BOW POS BOE C(Freq) parent(C)(Freq) P(Freq) C+P(SG) parent(C)+P(SG) P(SG)

0.489 0.448 00.353 0.616 0.586 0.628 0.663 0.617 0.666

0.013 0.013 0.028 0.011 0.010 0.011 0.013 0.012 0.014

0.025 0.025 0.052 0.021 0.019 0.021 0.026 0.024 0.028

0.944 0.950 0.814 0.944 0.944 0.944 0.945 0.945 0.944

0.857 0.860 0.626 0.873 0.873 0.866 0.858 0.858 0.864

0.898 0.902 0.708 0.907 0.907 0.904 0.899 0.899 0.903

0.071 0.069 0.049 0.083 0.082 0.096 0.062 0.067 0.126

0.006 0.005 0.004 0.009 0.008 0.011 0.006 0.006 0.009

0.011 0.009 0.008 0.016 0.014 0.019 0.011 0.011 0.016

0.718 0.676 0.744 0.702 0.705 0.705 0.703 0.706 0.699

0.477 0.527 0.502 0.471 0.477 0.473 0.464 0.469 0.468

0.573 0.592 0.600 0.564 0.569 0.566 0.559 0.563 0.560

0.747 0.695 0.656 0.691 0.740 0.710 0.726 0.738 0.713

0.143 0.150 0.108 0.140 0.141 0.145 0.140 0.143 0.138

0.240 0.247 0.186 0.233 0.237 0.241 0.235 0.240 0.232

0.723 0.667 0.733 0.722 0.728 0.724 0.728 0.716 0.739

0.489 0.517 0.498 0.486 0.489 0.484 0.490 0.490 0.496

0.583 0.582 0.593 0.581 0.585 0.580 0.586 0.582 0.593

DisAcc

BOW POS BOE C(Freq) parent(C)(Freq) P(Freq) C+P(SG) parent(C)+P(SG) P(SG)

0.216 0.322 0.875 0.293 0.267 0.238 0.237 0.268 0.248

0.002 0.009 00.04 0.002 0.002 0.002 0.002 0.002 0.002

0.004 0.017 0.076 0.004 0.004 0.004 0.004 0.005 0.004

0.955 0.951 0.810 0.951 0.953 0.953 0.953 0.953 0.954

0.869 0.860 0.629 0.881 0.883 0.871 0.866 0.866 0.873

0.910 0.903 0.708 0.915 0.917 0.910 0.907 0.908 0.912

0.584 0.273 0.494 0.553 0.568 0.519 0.570 0.578 0.643

0.059 0.029 0.043 0.070 0.060 0.070 0.067 0.062 0.059

0.107 0.052 0.079 0.125 0.109 0.123 0.120 0.112 0.109

0.782 0.746 0.806 0.783 0.789 0.777 0.786 0.785 0.800

0.608 0.630 0.653 0.599 0.611 0.600 0.606 0.615 0.603

0.684 0.688 0.722 0.679 0.689 0.677 0.684 0.689 0.688

0.835 0.719 0.909 0.835 0.835 0.835 0.835 0.835 0.835

0.090 0.090 00.048 0.090 0.090 0.090 0.090 0.090 0.090

0.162 0.159 0.092 0.162 0.162 0.162 0.162 0.162 0.162

0.819 0.744 0.744 0.805 0.814 0.805 0.812 0.816 0.815

0.605 0.625 0.648 0.605 0.601 0.591 0.598 0.602 0.607

0.696 0.679 0.692 0.691 0.692 0.681 0.689 0.693 0.695

Table 3: Results obtained for the DB, FB and DB-FB cross-source SVM TC in terms of precision, recall and F1 measure. The numbers highlighted in bold show the best results obtained for the semantic features and lexical features for each scenario.

0.4

War

−0.37

−0.66

−0.26

0.2

0.0

DisAcc

0.17

−0.22

0.19

−0.2

−0.4

Crime

0.34

−0.61

0.26

Pr op

er ty

−E nt ro

py

py y− En tro En tit

C

la

ss

−E

nt ro py

−0.6

Figure 4: Pearson correlation coefficient values between the Topic-Class bag entropy(ClassEntropy ), Entity-Class entropy (Entity-Entropy, Topic-Property entropy (Property-Entropy ) and the performance of the cross-source TC classifiers. proaches. 2. The level of ambiguity of the entities found in the topics, measured by the Entity-class entropy showed to have an impact on the performance of the TC classifier. That is, topics which contain a smaller number of highly ambiguous entities achieved a higher accuracy than those with large number of ambiguous entities.

7.2

Comparing multiple KSs with single KS for TC

As shown in Table 3 the best performances across all

the features, feature weighting strategies and augmentation strategies was achieved by the proposed (DB + F B + T W ) TC classifier, which successfully harnessed multiple linked KSs and significantly improved over the baselines using single KS by 11.9-30.7% (over DB +T W ) and 13.4-31.4% (over F B + T W ) (t-test with α < 0.05). In addition, a significant improvement of 9.3%-28.2% can be observed over the T W (db+yago+f b)) classifier built on Twitter data only. (ttest with α < 0.05). Furthermore, the superiority of the TW TC classifier over the DB, F B and DB + F B TC classifiers are in light with our previous findings, which demonstrated that outperforming the TW TC classifiers is extremely difficult using KS data alone ([16]). The biggest improvement of the linked KS (DB + F B + T W ) classifier was achieved for the Cri topic (31.4% over F B + T W ), for which the F B + T W single KS classifier using BOW features performed better than the DB + T W single KS classifier, however, there were a relatively high (3,377) number of articles without any entity. In this case the combined mapped db+yago+f b ontology could increased the coverage of the classes found in the merged DB + F B data. The best enrichment strategies that have consistently improved over the baselines are the W-SG for Sem-P, semantic augmentation by feature frequency (C(Freq)) and by generalisation (parent(C)(Freq)) (Table 3 column 8). The large proportion of uncovered entities using Freebase is due to the lack of coverage and specialisation. For example from the total number of entities extracted by OpenCalais a large proportion (40%) of the entities were not found in Freebase KS. Considering DBpedia KS, 35% of the entities were not assigned any URI, while from the Twitter dataset 36% of the entities did not have any URI. Nevertheless, the improvement in F1 measure after harnessing both KSs suggest that they complement each other well. Freebase brings its strength in content coverage for the topics, while DBpedia brings useful semantic evidence about the entities which are covered. In conclusion, considering these results, our findings are as follows:

1. Linked KSs combined with Twitter data (DB + F B + T W ) provide complementary information for TC of Tweets at both lexical and semantic levels, significantly outperforming the single KS approaches (DB + T W and F B + T W ) and the approach using Tweets only. 2. The performance of a linked KS based TC depends on both the usefulness of the datasets collected from the individual KSs, and the coverage of the entities within those KSs. When entities have low coverage in a KS, exploiting the mapping between the corresponding KS ontologies is beneficial.

8.

DISCUSSION AND FUTURE DIRECTIONS

The main advantage of our approach is that it exploits the semantic information present in multiple linked KSs (DBpedia and Freebase) about entities contained in KS articles and Tweets. The incompleteness and the inconsistencies within the KSs have some drawbacks to the performance of our approach. We observed that there are entity types which were mapped to a very generic class, rather than a more informative specific one. For example in Freebase the /crime/crime accuser class is derived from a very generic /common/topic class, while another related class type /crime/convicted criminal extends the /people/person class. This mismatch can affect the generalisation of the patterns learned by our TC, in that entities, which should be considered together might belong to different entity types. One possible solution to overcome this problem is to perform a cross-consistency validation, by investigating the overlapping properties between the entities assigned to the same entity classes, and to consider the most likely entity classes ([4]). The frequent usage of ungrammatical English words in Tweets also poses challenges for our TC of Tweets. Due to the restricted size of short messages, entities such as country names (nkorea) are often abbreviated, as in the following Tweet: “nkorea prepared nuclear weapons holy war south official tells state media usa”. These irregularities result in that current annotation services (including OpenCalais API) will ignore these entities, and therefore no semantic information will be exploited for them within this framework. A possible solution to address these challenges is to apply lexical normalisers especially developed for Tweets ([7]) to normalise these words to standard English terms. Incorporating additional information about entities in terms of their properties (e.g. “birthPlace”, “predisentOf”) and their roles (e.g. “President”, “Author”, “Spouse”) allows a more fine grained representation of them. This information can be beneficial to TC as certain entity types are more likely to be indicative of a topic than others. In this work we focused on contextualising entities by leveraging the semantic information presented in the taxonomies of linked KSs. A natural extension of our work could be to automatically extract and contextualise the properties inside the articles and Tweets, which could further allow to unify the feature types used, and thus reduce data sparsity. For example properties such as inhabits and resident, or vote and elect could be considered synonymous, and thus providing a semantic type for them could furthermore improve the generalisation of our approach [19].

9.

CONCLUSION

This paper concerns the issue of detecting the topic of a stream of Tweets in social media. The approach adopted exploits the information present in linked KSs (such as DBpedia and Freebase), links the entities found in the KS articles and user generated content to a URI defining its intended meaning; it creates a semantic graph for each concept using the link structure of linked KSs, enabling a content to be represented with higher level semantic information allowing a better generalisation between KSs and Twitter. Due to the nature of social media, the content contains various ambiguous concepts making the recognition of topics a challenging task, mostly because of the limited contextual information for allowing to disambiguate them. However, using the information present in linked KSs and the links between entities it is possible to exploit the various roles of the entities together with the specific and more generic properties describing them. This paper shows the importance of considering the information present in linked KSs in the context of Violence detection and Emergency Responses, achieving significant improvement over various state-of-the-art approaches using single KSs without linked data and approaches utilising Tweets only. Our future work will focus on validating our approach on other datasets (e.g. blogs, forums), considering other sub-topics too.

10. REFERENCES [1] F. Abel, Q. Gao, G.-J. Houben, and K. Tao. Analyzing user modeling on [2]

[3]

[4]

[5] [6]

[7]

[8]

[9]

[10] [11]

[12]

[13]

[14] [15]

[16]

[17] [18] [19]

twitter for personalized news recommendations. In Proceedings of the 19th international conference on User modeling, adaption, and personalization, UMAP’11, pages 1–12, Berlin, Heidelberg, 2011. Springer-Verlag. C. Bizer, J. Lehmann, G. Kobilarov, S. Auer, C. Becker, R. Cyganiak, and S. Hellmann. Dbpedia - a crystallization point for the web of data. J. Web Sem., 7(3):154–165, 2009. K. D. Bollacker, C. Evans, P. Paritosh, T. Sturge, and J. Taylor. Freebase: a collaboratively created graph database for structuring human knowledge. In SIGMOD Conference, pages 1247–1250, 2008. J. Dolby, A. Fokoue, A. Kalyanpur, E. Schonberg, and K. Srinivas. Extracting Enterprise Vocabularies Using Linked Open Data. In 8th International Semantic Web Conference (ISWC2009), Oct. 2009. A. Garcia-Silva, O. Corcho, and J. Gracia. Associating semantics to multilingual tags in folksonomies, 2010. Y. Genc, Y. Sakamoto, and J. V. Nickerson. Discovering context: classifying tweets through a semantic transform based on wikipedia. In Proceedings of the 6th international conference on Foundations of augmented cognition: directing the future of adaptive systems, FAC’11, pages 484–492, Berlin, Heidelberg, 2011. Springer-Verlag. B. Han and T. Baldwin. Lexical normalisation of short text messages: makn sens a #twitter. In Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies - Volume 1, HLT ’11, pages 368–378, Stroudsburg, PA, USA, 2011. Association for Computational Linguistics. J. Hoffart, F. M. Suchanek, K. Berberich, and G. Weikum. Yago2: a spatially and temporally enhanced knowledge base from wikipedia. Artificial Intelligence Journal, Special Issue on Artificial Intelligence, Wikipedia and Semi-Structured Resources, 2012. M. Michelson and S. A. Macskassy. Discovering users’ topics of interest on twitter: a first look. In Proceedings of the fourth workshop on Analytics for noisy unstructured text data, AND ’10, New York, NY, USA, 2010. D. Milne and I. H. Witten., editors. Learning to link with Wikipedia. 2008. O. Mu˜ noz Garc´ ıa, A. Garc´ ıa-Silva, O. Corcho, M. de la Higuera Hern´ andez, and C. Navarro. Identifying Topics in Social Media Posts using DBpedia. In M. Jean-Dominique, H. Hrasnica, and F. Genoux, editors, Proceedings of the NEM Summit, pages 81–86. NEM Initiative, Eurescom ? the European Institute for Research and Strategic Studies in Telecommunications ? GmbH, Sept. 2011. N. L. M. Phan, X. H. and S. Horiguchi. Learning to classify short and sparse text and web with hidden topics from large-scale data collections. In ACM, editor, Proceeding of the 17th international conference on World Wide Web., 2008. A. Ritter, S. Clark, Mausam, and O. Etzioni. Named entity recognition in tweets: An experimental study. In Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing, pages 1524–1534, Edinburgh, Scotland, UK., July 2011. Association for Computational Linguistics. G. Rizzo and R. Troncy. Nerd : a framework for evaluating named entity recognition tools in the web of data. 2011. Y. Song, H. Wang, Z. Wang, H. Li, and W. Chen. Short text conceptualization using a probabilistic knowledgebase. In IJCAI, pages 2330–2336, 2011. A. Varga, A. E. Cano, and F. Ciravegna. Exploring the similarity between social knowledge sources and twitter for cross-domain topic classification. In Proceedings of the Knowledge Extraction and Consolidation from Social Media, 11th International Semantic Web Conference (ISWC2012), 2012. D. Vitale, P. Ferragina, and U. Scaiella. Classification of short texts by deploying topical annotations. In ECIR, pages 376–387, 2012. T. Xu and D. W. Oard. Wikipedia-based topic clustering for microblogs. Proc. Am. Soc. Info. Sci. Tech., 48(1):1–10, 2011. Z. Zhang, A. L. Gentile, and F. Ciravegna. Harnessing different knowledge sources to measure semantic relatedness under a uniform model. In EMNLP, pages 991–1002, 2011.

Entity Linking in Web Tables with Multiple Linked Knowledge Bases