Department of Computer Science
Named Entities in the Digital Humanities This presentation: http://j.mp/nerdh Eetu Mäkelä (http://www.seco.tkk.fi/u/jiemakel/)
Department of Computer Science
CKCC - An example of a DH project utilizing NER
Department of Computer Science
Recogito - An example of a Named Entity reconciliation tool
Department of Computer Science
Particularities of NER in the Digital Humanities ● Humanities materials are complex: a single document may contain multiple languages, language may be old or change through time in a corpus, …
Department of Computer Science
Particularities of NER in the Digital Humanities ● Humanities scholars are: ● extremely thorough in verifying information ● used to huge amounts of manual work → NER is a part of a much larger process → Assume someone is going to manually go through your NER results → Recall much more important than precision → Important to discover named entity occurrences, but not e.g. derive entity types
Department of Computer Science
Particularities of NER in the Digital Humanities ● It is important to go beyond locating named entity surface forms to strongly identify the individuals beyond them → Coreference resolution, use of databases of identities and name variants
Department of Computer Science
Further examples of Research Questions in the Digital Humanities ● Ancient Name Dropping: ● Co-citation graph of mythical and real authorities in ancient Greek scientific texts ● Contextual reader: ● First World War primary sources ● Ancient texts (2) ● Finnish law ● Corpus of Early English Correspondence: ● How much do highly educated people use the word happiness vs those of a lower education? ● Bibliothèque nationale de France: ● Which places publish disproportionately much philosophy in French in the 18th century? Department of Computer Science
Department of Computer Science
Data sources for named entity information
Department of Computer Science
Virtual International Authority File ● http://viaf.org/viaf/98930150/ ● Joins together authority files of 45 national libraries and other institutions ● “Anyone who has ever published anything that is in any of the catalogues of the participating libraries” ● People and organizations ● 2014/02: 50 million names for 19 million entities ● 2015/05: 274 million names for 79 million entities ● Some birth/death date information
Department of Computer Science
Problems for NER ● Automatic conversions from “Lastname, Firstname” to “Firstname Lastname” does not always work due to bad data
Charles-Victor Prévost d'Arlincourt Charles Victor Prévôt ˜d'œ Arlincourt Charles Victor Prevot d' Arlincourt Arlincourt
Department of Computer Science
Problems for NER ● Different forms of encoding, typoes (Paris,) (Paris) (Paris.)
Paris A Paris [A Paris]
[Paris,] À Paris
[Paris] (Paris
Amsterdam. - et Paris Amsterdam ; et Paris Amsterdam. - et à Paris Amsterdam [Paris] (Paris. - Amsterdam A Amsterdam [i. e. Paris]. M. DCC. LXX.
Department of Computer Science
Getty Union List of Artists’ Names ● http://www.getty.edu/vow/ULANFullDisplay?find=rumi &role=&nation=&prev_page=1&subjectid=500337998 ● Names, birth/death dates, education, occupation, relationships ● 2011: 600 000 names for 200 000 people
Department of Computer Science
Consortium of European Research Libraries Thesaurus ● Place name and personal names in Europe in the period of hand press printing (1450 - c. 1830) ● http://thesaurus.cerl.org/cgi-bin/record.pl?rid=cnp0131 7268 ● 20,000 place names, 900,000 names for people ● Names, biographical dates, activities, publications
Department of Computer Science
Wikidata ● https://www.wikidata.org/wiki/Q43347 ● Structured information on 14 million Wikipedia entities
Department of Computer Science
DBpedia ● http://dbpedia.org/page/Rumi ● Structured information extracted from Wikipedia infoboxes
Department of Computer Science
Publication information sources ● ● ● ● ● ●
DNB: Deutsche Nationalbibliografie BNF: Bibliographie nationale française BNB: British National Bibliography EEBO: Early English Books Online (1475-1700) ECCO: Eighteenth Century Collectons Online OCLC WorldCat: 305 million books from OCLC member libraries
Department of Computer Science
Structured data sources for places ● Getty Thesaurus of Geographic Names - 2 million names for 1,4 million modern and historical places ● http://www.getty.edu/vow/TGNFullDisplay?find=rome &place=&nation=&prev_page=1&english=Y&subjecti d=7000874 ● GeoNames - 10 million names for 9 million places ● Pleiades - 35,000 ancient places ● National gazetteers ● Historical Gazetteer of England’s Place-Names ● PNR ● DBpedia, Wikidata ● Place names in other datasets (BNF,BNB,..) Department of Computer Science
Structured data sources for other entities ● Getty Cultural Objects Name Authority (CONA) ● Gallery and museum databases, e.g. British Museum, Finnish National Gallery, Europeana, Digital Public Library of America ● Wikidata, DBpedia ● Domain-specific vocabularies such as WW1LOD
Department of Computer Science