Department of Computer Science

Named Entities in the Digital Humanities This presentation: http://j.mp/nerdh Eetu Mäkelä (http://www.seco.tkk.fi/u/jiemakel/)

Department of Computer Science

CKCC - An example of a DH project utilizing NER

Department of Computer Science

Recogito - An example of a Named Entity reconciliation tool

Department of Computer Science

Particularities of NER in the Digital Humanities ● Humanities materials are complex: a single document may contain multiple languages, language may be old or change through time in a corpus, …

Department of Computer Science

Particularities of NER in the Digital Humanities ● Humanities scholars are: ● extremely thorough in verifying information ● used to huge amounts of manual work → NER is a part of a much larger process → Assume someone is going to manually go through your NER results → Recall much more important than precision → Important to discover named entity occurrences, but not e.g. derive entity types

Department of Computer Science

Particularities of NER in the Digital Humanities ● It is important to go beyond locating named entity surface forms to strongly identify the individuals beyond them → Coreference resolution, use of databases of identities and name variants

Department of Computer Science

Further examples of Research Questions in the Digital Humanities ● Ancient Name Dropping: ● Co-citation graph of mythical and real authorities in ancient Greek scientific texts ● Contextual reader: ● First World War primary sources ● Ancient texts (2) ● Finnish law ● Corpus of Early English Correspondence: ● How much do highly educated people use the word happiness vs those of a lower education? ● Bibliothèque nationale de France: ● Which places publish disproportionately much philosophy in French in the 18th century? Department of Computer Science

Department of Computer Science

Data sources for named entity information

Department of Computer Science

Virtual International Authority File ● http://viaf.org/viaf/98930150/ ● Joins together authority files of 45 national libraries and other institutions ● “Anyone who has ever published anything that is in any of the catalogues of the participating libraries” ● People and organizations ● 2014/02: 50 million names for 19 million entities ● 2015/05: 274 million names for 79 million entities ● Some birth/death date information

Department of Computer Science

Problems for NER ● Automatic conversions from “Lastname, Firstname” to “Firstname Lastname” does not always work due to bad data

Charles-Victor Prévost d'Arlincourt Charles Victor Prévôt ˜d'œ Arlincourt Charles Victor Prevot d' Arlincourt Arlincourt

Department of Computer Science

Problems for NER ● Different forms of encoding, typoes (Paris,) (Paris) (Paris.)

Paris A Paris [A Paris]

[Paris,] À Paris

[Paris] (Paris

Amsterdam. - et Paris Amsterdam ; et Paris Amsterdam. - et à Paris Amsterdam [Paris] (Paris. - Amsterdam A Amsterdam [i. e. Paris]. M. DCC. LXX.

Department of Computer Science

Getty Union List of Artists’ Names ● http://www.getty.edu/vow/ULANFullDisplay?find=rumi &role=&nation=&prev_page=1&subjectid=500337998 ● Names, birth/death dates, education, occupation, relationships ● 2011: 600 000 names for 200 000 people

Department of Computer Science

Consortium of European Research Libraries Thesaurus ● Place name and personal names in Europe in the period of hand press printing (1450 - c. 1830) ● http://thesaurus.cerl.org/cgi-bin/record.pl?rid=cnp0131 7268 ● 20,000 place names, 900,000 names for people ● Names, biographical dates, activities, publications

Department of Computer Science

Wikidata ● https://www.wikidata.org/wiki/Q43347 ● Structured information on 14 million Wikipedia entities

Department of Computer Science

DBpedia ● http://dbpedia.org/page/Rumi ● Structured information extracted from Wikipedia infoboxes

Department of Computer Science

Publication information sources ● ● ● ● ● ●

DNB: Deutsche Nationalbibliografie BNF: Bibliographie nationale française BNB: British National Bibliography EEBO: Early English Books Online (1475-1700) ECCO: Eighteenth Century Collectons Online OCLC WorldCat: 305 million books from OCLC member libraries

Department of Computer Science

Structured data sources for places ● Getty Thesaurus of Geographic Names - 2 million names for 1,4 million modern and historical places ● http://www.getty.edu/vow/TGNFullDisplay?find=rome &place=&nation=&prev_page=1&english=Y&subjecti d=7000874 ● GeoNames - 10 million names for 9 million places ● Pleiades - 35,000 ancient places ● National gazetteers ● Historical Gazetteer of England’s Place-Names ● PNR ● DBpedia, Wikidata ● Place names in other datasets (BNF,BNB,..) Department of Computer Science

Structured data sources for other entities ● Getty Cultural Objects Name Authority (CONA) ● Gallery and museum databases, e.g. British Museum, Finnish National Gallery, Europeana, Digital Public Library of America ● Wikidata, DBpedia ● Domain-specific vocabularies such as WW1LOD

Department of Computer Science

Named Entities in the Digital Humanities

Automatic conversions from “Lastname, Firstname” to. “Firstname Lastname” does not always work due to bad data. Problems for NER. Charles-Victor Prévost d'Arlincourt. Charles Victor Prévôt ˜d'œ. Arlincourt. Charles Victor Prevot d'.

941KB Sizes 2 Downloads 208 Views

Recommend Documents

(the) Digital Humanities? - Sign in Accounts
Software Development. Typically following an. "agile" development model ... Gephi. ○ Java. ○ Drupal References. ○ D3. ○ ArcGIS Network Analyst ...

recognizing named entities in biomedical texts
5.11 Frequencies of POS occurring in base NPs and NEs . ..... example also highlights the importance of acronym detection, because APL can stand for.

Digital Humanities
Mar 9, 2010 - for Italtel-Siemens telephone exchanges and enjoyed a protracted struggle with C ... short text messages between mobile telephones. The SMS ...

Our digital humanities
evaluation technical feedback content feedback data. Our digital humanities ... digital. Social network analysis. Social network analysis. Social network analysis.

Pronunciation Learning for Named-Entities ... - Research at Google
seed lexicon and an iterative optimization method for updating weights, finding .... We used Google's Voice Search production recognition engine as the speech ...

Linked data in practice in digital humanities projects
Information Services, ProQuest LLC and Gale Cengage. Learning) to produce services. • Often, they also participate in content creation projects, and then hold ...

Introduction to methods in digital humanities
Docent (Adjunct Professor) in Computer Science / Aalto University ... and paradox, allowing human-scale exploration of complex systems. - About -page of the Humanities + Design research laboratory at. Stanford. Digital humanities as ... Knowledge of

Introduction to methods in digital humanities
The digital humanities comprise the study of what happens at the intersection of computing tools with cultural artefacts of all kinds. This study begins where basic familiarity with standard software ends. It probes how these common tools may be used

Helsinki Centre for Digital Humanities
the active cooperation of industry and ... libraries and tutorials on the Internet. 3. High-level understanding of what types of things can be accomplished.

The Digital-Humanities Bust - The Chronicle of Higher Education.pdf ...
Oct 16, 2017 - interview series that #BlkTwitterstorians and other uses of social media have "helped people create maroon — free, black, liberatory,.

Digital humanities and global interaction
Knowledge of the fundamentals concepts of programming. ○ Frees ... Historical importance: “Belia, a researcher at the Modern Greek Historical Studies. Centre of the Athens Academy, shows that the Finiki area was known for its olive-growing and th

pdf-07100\william-blake-and-the-digital-humanities-collaboration ...
... the apps below to open or edit this item. pdf-07100\william-blake-and-the-digital-humanities-colla ... ia-routledge-interdisciplinary-perspectives-on-liter.pdf.