Introduction General intrinsic properties Text cleaning and processing Wordlist-based methods Syntactic functions Conclusions

Intrinsic Methods for Comparison of Corpora V´ıt Baisa and V´ıt Suchomel Natural Language Processing Centre Faculty of Informatics Masaryk University

December 6, 2013

Introduction General intrinsic properties Text cleaning and processing Wordlist-based methods Syntactic functions Conclusions

A need for comparison of corpora

There are large textual corpora from the web. . . but do we know what is inside?

Which corpus is generally better? Comparison based on inner properties of corpora.

Which corpus is better for a specific task? Comparison based on external use of corpus data.

Introduction General intrinsic properties Text cleaning and processing Wordlist-based methods Syntactic functions Conclusions

Intrinsic vs. extrinsic comparison

The paper describes 8 intrinsic methods of corpus comparison divided into the following groups: general intrinsic properties, text cleaning and processing, wordlist-based methods and syntactic analysis.

Extrinsic methods will be explored in a future paper.

Introduction General intrinsic properties Text cleaning and processing Wordlist-based methods Syntactic functions Conclusions

Data used in the experiment

The proposed methods were applied to compare two recent very large web-based Czech text corpora: Hector (Spoustov´a et al., 2010) czTenTen12 (Suchomel, 2012)

The majority of presented methods is language independent but both corpora must be in the same language.

Introduction General intrinsic properties Text cleaning and processing Wordlist-based methods Syntactic functions Conclusions

Size

A simple rule: The bigger the better. Because We need very large corpora to provide evidence about rare phenomena. (Pomik´alek et al., 2009) The measurement of words, tokens and sentences depends on the means of tokenization and sentence detection algorithms used for processing corpus data. CORPUS Hector czTenTen12

BYTES 17 GB 31 GB

TOKENS 3.285 bn 5.437 bn

WORDS 2.607 bn 4.458 bn

SENTENCES 219 m 303 m

Introduction General intrinsic properties Text cleaning and processing Wordlist-based methods Syntactic functions Conclusions

Diversity of sources

The more diverse source of the data, the better coverage of language by the corpus may be expected. Hector: constructed from manually selected web sites with large and good-enough-quality textual content (e.g. news servers, blog sites, discussion fora). czTenTen12: a general Czech web crawl. Constraining sources of a monolingual corpus to the corresponding national TLD – useful in the case of Czech. CORPUS czTenTen12

PAGES 9,747,315

DOMAINS 233,122

AVG 42

MED 4

TLDS 97.6 % cz

Introduction General intrinsic properties Text cleaning and processing Wordlist-based methods Syntactic functions Conclusions

Sentence length

millions of sentences

Sentence border detection – different solutions observed.

1087 106 105 104 103 102 101 100 10 0

czTenTen12 Hector

50

100

words

150

200

250

Introduction General intrinsic properties Text cleaning and processing Wordlist-based methods Syntactic functions Conclusions

Data duplicity The less duplicate texts in a corpus the better. However, a very strict deduplication results in removing usable data needlessly. Hector: paragraphs containing more than 30% seen 8-grams were removed czTenTen12: paragraphs containing more than 50% seen 7-grams were removed onion was used to remove sentences consisting of 50% seen 5-grams of sentences (with smoothing disabled). CORPUS Hector czTenTen12

BYTES -23.3 % -17.6 %

TOKENS -25.8 % -18.7 %

SENTENCES -23.6 % -18.4 %

Introduction General intrinsic properties Text cleaning and processing Wordlist-based methods Syntactic functions Conclusions

The test

The less paragraphs full of text in unwanted language the better. However, some level of foreign words cannot be avoided, e.g. in developers’ blogs, movie or music reviews. czTenTen12 log10 positions 0

10

Hector

The the

THE 10k

100

The the

THe 100k

THE

tHe

1M

ThE THe tHe

ThE 10M

Introduction General intrinsic properties Text cleaning and processing Wordlist-based methods Syntactic functions Conclusions

Filtering wordlists

Unknown words from corpus wordlists were filtered out by a morphologic analyzer. The bigger the size of the rest, the better. Czech fast analyser Majka was used. 100%

0

0.4

0.1

0.01 0.28 1.41 1.54

100

-0.21 -0.78

Hector

czTenTen12

% of wordlist

0.51

500

1k

5k

10k

50k

100k

500k

1M

5M

size of wordlist

-0.31 10M

Introduction General intrinsic properties Text cleaning and processing Wordlist-based methods Syntactic functions Conclusions

Keyword comparison Following Kilgarrif’s work, lowercase keywords were extracted to reveal in which words these corpora differ the most. Both recent web corpora contain more data from internet message boards and less news documents than the Czech National Corpus. Hector vs. SYN2000: taky, teda, ahoj, holky, m´am, fakt, moc, sem, dneska, takˇze, blog, nev´ım, m´aˇs, super, r´ada, ahojky (discussions of women). czTenTen12 vs. SYN2000: taky, m˚ uˇzete, moc, dˇekuji, takˇze, cca, m´am, dobr´y, opravdu, dle, ahoj, bych, jestli, d´ıky, hodnˇe, super (discussions). SYN2000 vs. Hector and czTenTen12: praha, vˇcera, korun, procent, ˇcesk´e, vl´ady, st´atn´ı, mili´ on˚ u, z´akona, trhu, ministr, ˇreditel, v´ystava, spoleˇcnost, nato, prezident, ˇctk (standard language, news, Prague).

Introduction General intrinsic properties Text cleaning and processing Wordlist-based methods Syntactic functions Conclusions

Syntactic functions

Syntactically correct sentences are good. Presence of the main syntactic roles – subject and predicate was checked. That rules out web garbage (navigation and labels, tables, program code, SEO keywords, link spam, generated texts,. . . ) but also syntactically problematic but otherwise quite common and understandable sentences. Set was used to carry out the experiment. CORPUS Hector czTenTen12

NCL 36.6 % 39.6 %

NSEN 19.0 % 23.6 %

PNSEN 23.7 % 29.2 %

Introduction General intrinsic properties Text cleaning and processing Wordlist-based methods Syntactic functions Conclusions

Future work

Explore other intrinsic methods: perplexity of language models, finding topics, measuring homogenity and heterogenity.

Develop extrinsic methods: word sketch evaluation (submitted to LREC 2014), morphological segmentation morfessor.

Introduction General intrinsic properties Text cleaning and processing Wordlist-based methods Syntactic functions Conclusions

Conclusion

Eight methods for a general systematic comparison of text corpora were developed and described. The methods were applied to two recent very large Czech web corpora. The related tools can be downloaded from the website of the project. http://nlp.fi.muni.cz/projekty/corpora_comparison

Intrinsic Methods for Comparison of Corpora - raslan 2013

Dec 6, 2013 - syntactic analysis. ... large and good-enough-quality textual content (e.g. news ... program code, SEO keywords, link spam, generated texts,. . . ).

100KB Sizes 2 Downloads 296 Views

Recommend Documents

Building Corpora of Technical Texts - raslan 2011
supported in ithe metadata of DML-CZ (only this namespace is allowed and supported there, e.g. by conversion to MathML). ... tural information for machine processing. It is still easily extensible by Content .... To provide a test platform for mathem

Building Corpora of Technical Texts - raslan 2011
Abstract. Building corpora of technical texts in Science, Technology,. Engineering, and Mathematics (STEM) domain has its specific needs, especially the handling of mathematical formulae. In particular, there is no widely accepted format to represent

Web Application for Semantic Network Editing - raslan 2013
icographic projects, i.e. for development of the Czech Lexical Database [9], or ... The DEB platform is based on client-server architecture, which brings along.

Web Application for Semantic Network Editing - raslan 2013
user and dictionary management, cooperation modules - building blocks client lightweight applications graphical or web interface. Adam Rambousek, Tomáš ...

Acquiring Data for Textual Entailment Recognition - raslan 2013
Both true and false entailments are needed,. 1 http://www.oecd.org/pisa/. 2 http://www.piaac.cz/. 3 http://pascallin2.ecs.soton.ac.uk/Challenges/RTE2 ..... and Evaluation 47(1), 9–31. (2013), http://dx.doi.org/10.1007/s10579- 012- 9176- 1. 17. Werb

Web Application for Semantic Network Editing - raslan 2013
Introduction. Semantic network editing. VisDic – offline desktop application. DEBVisDic – online reimplementation in DEBiilatform developed as an extension for ...

Web Application for Semantic Network Editing - raslan 2013
The design of the DEB allows us to modify it also for building wordnet-like databases. For this purpose, VisDic tool was re-implemented on top of the DEB platform, as the DEBVisDic editor[1]. DEBVisDic editor was designed as a client application for

Acquiring Data for Textual Entailment Recognition - raslan 2013
acquiring language resources by means of a language game which is a cheap but long-term method. ... Key words: textual entailment, language game, games with a purpose,. GWAP. 1 Language Resources and .... between an adverbial and an object therefore

COMPARISON METHODS FOR PROJECTIONS AND.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. COMPARISON ...

a comparison of methods for determining the molecular ...
2009). The spatial and mass resolution of the cosmological simulation is matched exactly to the spatial and mass resolution of the simulation with the fixed ISM. Due to computational expense, the fully self- consistent cosmological simulation was not

A Comparison of Methods for Finding Steady- State ...
The existing commercial tools suggest a brute-force approach i.e. simulation ..... to sampled-data modeling for power electronic circuits , IEEE. Transactions on ...

Comparison of Stochastic Collocation Methods for ...
Oct 30, 2009 - ... of 58800 volumes with a double cell size in the direction tangential ...... Factorial sampling plans for preliminary computational experiments, ...

Comparison of Four Methods for Determining ...
limit the access of lysostaphin to the pentaglycine bridge in the ... third glycines to the pentaglycine cross bridge. .... Lysostaphin: enzymatic mode of action.

A Comparison of Clustering Methods for Writer Identification and ...
a likely list of candidates. This list is ... (ICDAR 2005), IEEE Computer Society, 2005, pp. 1275-1279 ... lected from 250 Dutch subjects, predominantly stu- dents ...

A Comparison of Methods for Finding Steady- State ...
a converter is described as a linear time-invariant system. (LTI). ... resonant converters the procedure to find the solution is not as easy as it may .... New set of Ai.

Comparison of Camera Motion Estimation Methods for ...
Items 1 - 8 - 2 Post-Doctoral Researcher, Construction Information Technology Group, Georgia Institute of. Technology ... field of civil engineering over the years.

A comparison of chemical pretreatment methods for ...
conversion of lignocellulosic biomass to ethanol is, how- ever, more challenging ..... treatments, subsequent calculations and analysis of data in this study were ...

A comparison of numerical methods for solving the ...
Jun 12, 2007 - solution. We may conclude that the FDS scheme is second-order accurate, but .... −a2 − cos bt + 2 arctan[γ(x, t)] + bsin bt + 2 arctan[γ(x, t)]. − ln.

Comparison of Training Methods for Deep Neural ... - Patrick GLAUNER
Attracted major IT companies including Google, Facebook, Microsoft and Baidu to make ..... Retrieved: April 22, 2015. The Analytics Store: Deep Learning.

A comparison of chemical pretreatment methods for ...
xylan reduction (95.23% for 2% acid, 90 min, 121 ╟C/15 psi) but the lowest cellulose to glucose ... research on the development of alternative energy sources.

Fast Construction of a WordNumber Index for Large Data - raslan 2013
Construction of a Word↔Number Index. 7. 12. 2013. 1 / 7. Page 2. Introduction. • Inspiration: Aleš Horák @ 1st NLP Centre seminar :-) • (but we still did not compare Manatee and some sql DB). • Problem: indexes for large text corpora (billi

Fast Construction of a Word↔Number Index for Large ... - raslan 2013
also for many other applications, e.g. building data for morphological analysers ... database management system must be in place – and so is this the case of the.