Interfacing structured and unstructured data in sociolinguistic research on language change Terttu Nevalainen, Samuli Kaislaniemi, Anni Sairio, Tanja Säily, Anna Merikallio, Taru Nordlund, Katja Litola, Johanna Utriainen, Eetu Mäkelä, Poika Isokoski, Harri Siirtola

Linguistic end research questions • Social meaning of spelling variation in historical periods of English and Finnish • Social variation in language productivity in early English correspondence

Subprojects – the long road from data to questions 1. 2. 3. 4.

From letters to data (Finnish) Quality control of data (English) Tools for linguistic research (English ⇒ Finnish) Linguistic research

From letters to data Katja Litola, Johanna Utriainen

From handwritten letters to structured data

Corpus building Digital letter corpus of Early Modern Finnish:

● ● ●

Socially, temporally and regionally balanced corpus of letters from the long 19th century (1800-1921). Handwritten letters unearthed from public and private archives around Finland. Transcribing, checking and re-checking about one thousand handwritten letters.

Critical points​: ●

Working with original manuscript material is extremely laborious and time-consuming; negotiations with memory organizations; legal restrictions to publish on-line; protection of identity.

Quality control of data Samuli Kaislaniemi, Anni Sairio

Linguistic variation: give vs giue, up vs vp old

new

=?=

What the editor says: “letters in the present edition are published … precisely as written”

Reality: checking editions vs manuscripts

Example: spelling variation in 17th-century letters All editions

Assess editions

‘Best’ editions

Tools for linguistic research Harri Siirtola, Eetu Mäkelä

Tools for linguistic research • Starting from two tools: TVE & Khepri • Moving to develop tools in dialogue with and as part of end user linguistic research

Linguistic research (and tools for such) Tanja Säily, Eetu Mäkelä, Jukka Suomela

Case study: derivational productivity of -er and -or ● Verb + suffixes -er and -or: driver, governor, filler ● Corpora of Early English Correspondence: spelling variation, false positives

○ er(e), ar(e), or(e), our(e), owr(e), ur(e), r + plural, possessive… ○ \S*(([rR]|[eEoO]~)(=?|=?[eE]=?|[='~]*[eEiIyY]?[='~]*[sSzZ][=']*))( ?![a-zA-Z'~=+]) ○ 6800 candidate words, 400 000 appearances

FiCa

Derivational productivity of -er and -or ● 5080 words out of 6800 irrelevant after manual study ● 153 words out of 6800 needed further study ○ 11768 individual uses

Case study: newly coined words • Compare Corpus of Early English Correspondence words to:

• The millions of words in Eighteenth Century Collections Online, Early English Books Online, British Library Newspapers, Burney Collection, Nichols Collection • Structured information in the Oxford English Dictionary

[email protected] http://j.mp/s-makela This presentation: http://j.mp/stratas-l

Interfacing structured and unstructured data in ...

Interfacing structured and unstructured data ... From handwritten letters to structured data ... er(e), ar(e), or(e), our(e), owr(e), ur(e), r + plural, possessive…

7MB Sizes 2 Downloads 279 Views

Recommend Documents

unstructured data and the enterprise - GitHub
make up the largest amount of unstructured data cura ... Most of these systems leverage metadata to provide an extra layer of .... Various media formats (images, audio, and video) and social media chatter are also .... Web sites that are primarily da

Collaborative Research: Citing Structured and Evolving Data
Repeatability: This is specific to data citation and important to this proposal. ... Persistent identifiers (Digital Object Identifiers, Archival Resource Keys, Uniform ...... Storage. In VLDB, pages 201–212, 2003. [56] Alin Deutsch and Val Tannen.

Exploiting evidence from unstructured data to enhance master data ...
reports, emails, call-center transcripts, and chat logs. How-. ever, those ...... with master records in IBM InfoSphere MDM Advanced. Edition repository.

BloomCast Efficient And Effective Full-Text Retrieval In Unstructured ...
BloomCast Efficient And Effective Full-Text Retrieval In Unstructured P2P Networks.pdf. BloomCast Efficient And Effective Full-Text Retrieval In Unstructured P2P ...

pdf-1864\pc-interfacing-and-data-acquisition-techniques-for ...
pdf-1864\pc-interfacing-and-data-acquisition-techniques-for-measurement-instrumentation-and-control.pdf. pdf-1864\pc-interfacing-and-data-acquisition-techniques-for-measurement-instrumentation-and-control.pdf. Open. Extract. Open with. Sign In. Main

Real-time RDF extraction from unstructured data streams - GitHub
May 9, 2013 - This results in a duplicate-free data stream ∆i. [k.d,(k+1)d] = {di ... The goal of this step is to find a suitable rdfs:range and rdfs:domain as well ..... resulted in a corpus, dubbed 100% of 38 time slices of 2 hours and 11.7 milli

Inference-Based Access Control for Unstructured Data - Liz Stinson
Apr 21, 2009 - Virtual Private Database (VPD) Oracle's VPD entails dy- namically rewriting ..... IBM Database Magazine Quarter 1, 2007, Vol. 12,. Issue 1 (May ...

Google Structured Data Testing Tool.pdf
http://shawwebsitedesign.jimdo.com/2014/01/03/jimdo-tutorial-learn-css3-animation/. Select the HTML tab to view the retrieved HTML and experiment with ...