Interfacing structured and unstructured data in sociolinguistic research on language change Terttu Nevalainen, Samuli Kaislaniemi, Anni Sairio, Tanja Säily, Anna Merikallio, Taru Nordlund, Katja Litola, Johanna Utriainen, Eetu Mäkelä, Poika Isokoski, Harri Siirtola

Linguistic end research questions • Social meaning of spelling variation in historical periods of English and Finnish • Social variation in language productivity in early English correspondence

Subprojects – the long road from data to questions 1. 2. 3. 4.

From letters to data (Finnish) Quality control of data (English) Tools for linguistic research (English ⇒ Finnish) Linguistic research

From letters to data Katja Litola, Johanna Utriainen

From handwritten letters to structured data

Corpus building Digital letter corpus of Early Modern Finnish:

● ● ●

Socially, temporally and regionally balanced corpus of letters from the long 19th century (1800-1921). Handwritten letters unearthed from public and private archives around Finland. Transcribing, checking and re-checking about one thousand handwritten letters.

Critical points​: ●

Working with original manuscript material is extremely laborious and time-consuming; negotiations with memory organizations; legal restrictions to publish on-line; protection of identity.

Quality control of data Samuli Kaislaniemi, Anni Sairio

Linguistic variation: give vs giue, up vs vp old



What the editor says: “letters in the present edition are published … precisely as written”

Reality: checking editions vs manuscripts

Example: spelling variation in 17th-century letters All editions

Assess editions

‘Best’ editions

Tools for linguistic research Harri Siirtola, Eetu Mäkelä

Tools for linguistic research • Starting from two tools: TVE & Khepri • Moving to develop tools in dialogue with and as part of end user linguistic research

Linguistic research (and tools for such) Tanja Säily, Eetu Mäkelä, Jukka Suomela

Case study: derivational productivity of -er and -or ● Verb + suffixes -er and -or: driver, governor, filler ● Corpora of Early English Correspondence: spelling variation, false positives

○ er(e), ar(e), or(e), our(e), owr(e), ur(e), r + plural, possessive… ○ \S*(([rR]|[eEoO]~)(=?|=?[eE]=?|[='~]*[eEiIyY]?[='~]*[sSzZ][=']*))( ?![a-zA-Z'~=+]) ○ 6800 candidate words, 400 000 appearances


Derivational productivity of -er and -or ● 5080 words out of 6800 irrelevant after manual study ● 153 words out of 6800 needed further study ○ 11768 individual uses

Case study: newly coined words • Compare Corpus of Early English Correspondence words to:

• The millions of words in Eighteenth Century Collections Online, Early English Books Online, British Library Newspapers, Burney Collection, Nichols Collection • Structured information in the Oxford English Dictionary

[email protected] This presentation:

Interfacing structured and unstructured data in ...

Interfacing structured and unstructured data ... From handwritten letters to structured data ... er(e), ar(e), or(e), our(e), owr(e), ur(e), r + plural, possessive…

7MB Sizes 2 Downloads 216 Views

Recommend Documents

unstructured data and the enterprise - GitHub
make up the largest amount of unstructured data cura ... Most of these systems leverage metadata to provide an extra layer of .... Various media formats (images, audio, and video) and social media chatter are also .... Web sites that are primarily da

Exploiting evidence from unstructured data to enhance master data ...
reports, emails, call-center transcripts, and chat logs. How-. ever, those ...... with master records in IBM InfoSphere MDM Advanced. Edition repository.

pdf-1864\pc-interfacing-and-data-acquisition-techniques-for ...
pdf-1864\pc-interfacing-and-data-acquisition-techniques-for-measurement-instrumentation-and-control.pdf. pdf-1864\pc-interfacing-and-data-acquisition-techniques-for-measurement-instrumentation-and-control.pdf. Open. Extract. Open with. Sign In. Main

A Novel Dynamic Query Protocol in Unstructured P2P Networks
There are three types of architecture for peer-to-peer net- ... values. Controlled-flooding based algorithms are widely used in unstructured networks such as wireless ad hoc networks and sensor networks. Expanding Ring (ER) is the first protocol [3]