Fast Construction of a Word↔Number Index for Large Data ˇ Miloˇs Jakub´ıˇcek, Pavel Rychl´y, Pavel Smerk Natural Language Processing Centre Faculty of Informatics Masaryk University

7. 12. 2013

ˇ Smerk et al. (NLPC FI MU)

Construction of a Word↔Number Index

7. 12. 2013

1/7

Introduction • Inspiration: Aleˇs Hor´ ak @ 1st NLP Centre seminar :-) • (but we still did not compare Manatee and some sql DB) • Problem: indexes for large text corpora (billions of tokens) • Current solution: .lex, .lex.idx and .lex.srt files • .lex: null-terminated strings, in the order of appearance in corpus • .lex.idx: 4B offsets of words in .lex • .lex.srt: 4B indices (positions in .lex.idx) sorted alphabetically • id2str: 2 accesses to the memory • str2id: 3 * ln2 |lexicon| accesses to the memory • New solution: HAT-trie + (reimplemented) Daciuk’s fsa tools • HAT-trie: cache-conscious, combines trie + hash, allows sorted access • for indexing natural language strings, it is among the best solutions regarding both time and space • Daciuk: minimal DAFSA for perfect hashing ˇ Smerk et al. (NLPC FI MU)

Construction of a Word↔Number Index

7. 12. 2013

2/7

Data sets used in the experiments

data set 100M 1000M 10000M

size 1148 MB 5161 MB 69010 MB

words 110 M 957 M 12967 M

unique 1660 k 1366 k 27892 k

size 31 MB 14 MB 384 MB

language Tajik French English

• three sets of corpus data: they differ not only in size • Tajik uses Cyrillic ⇒ words are two times longer only due to encoding • French corpus (OPUS project): mostly legal texts ⇒ limited vocabulary

ˇ Smerk et al. (NLPC FI MU)

Construction of a Word↔Number Index

7. 12. 2013

3/7

Comparison of encodevert and hat-trie data set 100M 1000M 10000M

encodevert time memory 3:11 m 0.44 GB 23:01 m 0.40 GB 7:38 h 0.98 GB data set 100M 1000M 10000M

hat-trie time memory 26.5 s 0.06 GB 2:21 m 0.04 GB 44:37 m 0.78 GB

encodevert local fair 3:27 m 1:25 m 26:10 m 6:26 m 9:21 h 4:02 h

size 44 MB 25 MB 607 MB

hat-trie fair 32.6 s 3:09 m 1:02 h

• the table from the paper have revealed to be unfair to encodevert • local data on local hdd, but probably more used • fair times: both apps produces the same set of files • in fact, this is still unfair, but now to hat-trie • ⇒ whole applications have to be tested ˇ Smerk et al. (NLPC FI MU)

Construction of a Word↔Number Index

7. 12. 2013

4/7

Reduction of the size of data data set 100M 1000M 10000M

encodevert time memory 3:11 m 0.44 GB 23:01 m 0.40 GB 7:38 h 0.98 GB

hat-trie time memory 26.5 s 0.06 GB 2:21 m 0.04 GB 44:37 m 0.78 GB

size 44 MB 25 MB 607 MB

data set 100M 1000M 10000M

fsa_ubuild time memory failed 15:48 m 0.11 GB 7:44 h 31.01 GB

hat + time 31.7 s 2:34 m 1:08 h

size 15 MB 11 MB 363 MB

new fsa memory 0.09 GB 0.06 GB 1.47 GB

• for very large corpora the files can consume a lot of memory • with Daciuk’s fsa tools we have built automata for perfect hashing • fsa_ubuild is an original Daciuk’s implementation (unsorted data) • hat + new fsa is an reimplementation with HAT-trie as presort • (experiments from the two tables were run on different hardware) ˇ Smerk et al. (NLPC FI MU)

Construction of a Word↔Number Index

7. 12. 2013

5/7

HAT-trie based sort + fsa overperforms fsa_ubuild data set 100M 1000M 10000M data set 100M 1000M 10000M

fsa_ubuild time memory failed 15:48 m 0.11 GB 7:44 h 31.01 GB hat-trie sort time memory 28.4 s 0.06 GB 2:51 m 0.04 GB 59:16 m 0.77 GB

hat + time 31.7 s 2:34 m 1:08 h

new fsa memory 0.09 GB 0.06 GB 1.47 GB

fsa_build time memory 12.4 s 0.21 GB 5.6 s 0.11 GB 35:15 m 27.07 GB

size 15 MB 11 MB 363 MB

new fsa time memory 4.2 s 0.03 GB 1.8 s 0.03 GB 9:36 m 0.71 GB

• the second table compares fsa construction from sorted data • ⇒ having such an effective sort algorithm, to sort data and then use

the algorithm for sorted data is always better than fsa_ubuild • ⇒ to reduce the used memory it would better to flush sorted data to hard disk before fsa construction, as the time penalty is minimal ˇ Smerk et al. (NLPC FI MU)

Construction of a Word↔Number Index

7. 12. 2013

6/7

Future Work

• it is a work in progress, even the measured times are biased • we want to • fine tune hat-trie (we have used default settings) • further reduce • • • •

compile space: fsa can be built directly in memory compile time: hash for “registered” nodes run space: VLEncoded information, relative adresses, UTF-8, . . . run time: smaller run space, numbers in arcs

• run experiments on a hdd not shared with other processes

ˇ Smerk et al. (NLPC FI MU)

Construction of a Word↔Number Index

7. 12. 2013

7/7

Fast Construction of a WordNumber Index for Large Data - raslan 2013

Construction of a Word↔Number Index. 7. 12. 2013. 1 / 7. Page 2. Introduction. • Inspiration: AleÅ¡ Horák @ 1st NLP Centre seminar :-) • (but we still did not compare Manatee and some sql DB). • Problem: indexes for large text corpora (billions of tokens). • Current solution: .lex, .lex.idx and .lex.srt files. • .lex: null-terminated ...

268KB Sizes 0 Downloads 196 Views

Recommend Documents

Fast Construction of a WordNumber Index for Large Data
the table from the paper have revealed to be unfair to encodevert. • local data on local hdd, but probably more used. • fair times: both apps produces the same set of files. • in fact, this is still unfair, but now to hat-trie. • ⇒ whole ap

Fast Construction of a WordNumber Index for Large Data
Fast Construction of a Word↔Number Index for Large Data. Miloš Jakub´ıcek, Pavel Rychlý, Pavel Šmerk. Natural Language Processing Centre. Faculty of ... (but we still did not compare Manatee and some sql DB). • Problem: indexes for large tex

Fast Construction of a Word↔Number Index for Large ... - raslan 2013
also for many other applications, e.g. building data for morphological analysers ... database management system must be in place – and so is this the case of the.

Fast Construction of a Word↔Number Index for Large Data
number to word indices for very large corpus data (tens of billions of tokens), which is ... database management system must be in place – and so is this the case of the ... it is among the best solutions regarding both time and space. We used ...

Acquiring Data for Textual Entailment Recognition - raslan 2013
Both true and false entailments are needed,. 1 http://www.oecd.org/pisa/. 2 http://www.piaac.cz/. 3 http://pascallin2.ecs.soton.ac.uk/Challenges/RTE2 ..... and Evaluation 47(1), 9–31. (2013), http://dx.doi.org/10.1007/s10579- 012- 9176- 1. 17. Werb

Acquiring Data for Textual Entailment Recognition - raslan 2013
acquiring language resources by means of a language game which is a cheap but long-term method. ... Key words: textual entailment, language game, games with a purpose,. GWAP. 1 Language Resources and .... between an adverbial and an object therefore

Intrinsic Methods for Comparison of Corpora - raslan 2013
Dec 6, 2013 - syntactic analysis. ... large and good-enough-quality textual content (e.g. news ... program code, SEO keywords, link spam, generated texts,. . . ).

Web Application for Semantic Network Editing - raslan 2013
icographic projects, i.e. for development of the Czech Lexical Database [9], or ... The DEB platform is based on client-server architecture, which brings along.

A Relational Model of Data for Large Shared Data Banks
banks must be protected from having to know how the data is organized in the machine ..... tion) of relation R a foreign key if it is not the primary key of R but its ...

Web Application for Semantic Network Editing - raslan 2013
user and dictionary management, cooperation modules - building blocks client lightweight applications graphical or web interface. Adam Rambousek, Tomáš ...

Web Application for Semantic Network Editing - raslan 2013
Introduction. Semantic network editing. VisDic – offline desktop application. DEBVisDic – online reimplementation in DEBiilatform developed as an extension for ...

Web Application for Semantic Network Editing - raslan 2013
The design of the DEB allows us to modify it also for building wordnet-like databases. For this purpose, VisDic tool was re-implemented on top of the DEB platform, as the DEBVisDic editor[1]. DEBVisDic editor was designed as a client application for

A Database Index to Large Biological Sequences
for arbitrarily large sequences, for instance for the longest human ... The largest public database of DNA1 .... sistent trees for large datasets over 50 Mbp . Using.

A Database Index to Large Biological Sequences
given that copying is by permission of the Very Large Data Base. Endowment ... tructure of parallel computers. ...... Sphere: Recovering A Persistent Object Store.

NCB Construction Contracts Index Second Quarter 2015
Oct 2, 2015 - your own independent legal, professional, accounting, investment, tax and other professional advisors prior to making any decision hereon.

NCB Construction Contracts Index Second Quarter 2015
Oct 2, 2015 - The scope of works will include: shopping mall (4- storey with leasable area ... Seif Engineering Contracting to build a residential com- pound in ...

NCB Construction Contracts Index Second Quarter 2015
Oct 2, 2015 - Source: Various sources, NCB. HEADLINES. NCB Construction Contracts Index. NCB Construction Contracts Index jumped to 341.98 ..... steam turbines– gas turbines– heat recovery steam genera- tor (HRSG), air condenser (ACC), control sy

Sailfish: A Framework For Large Scale Data Processing
... data intensive computing has become ubiquitous at Internet companies of all sizes, ... by using parallel dataflow graph frameworks such as Map-Reduce [10], ... Our Sailfish implementation and the other software components developed as ...

A Relational Model of Data for Large Shared Data Banks-Codd.pdf ...
A Relational Model of Data for Large Shared Data Banks-Codd.pdf. A Relational Model of Data for Large Shared Data Banks-Codd.pdf. Open. Extract. Open with.

Fast Edge-Preserving PatchMatch for Large ...
The column. “EPE s40+” means the average endpoint error over regions with flow ve- locities larger than 40 pixels per frame. The runtime are reproduced from.

Small Business Policy Index 2013: State Rankings
The “Small Business Policy Index: Ranking the States on Policy Measures and Costs ..... best state receives a score of “0.05” and the worst receives “2.50.”46 ...... Telephone: 703-242-5840 • Fax: 703-242-5841 • email: [email protected]