Sigma Encoded Inverted Files Andrew Trotman University of Otago, New Zealand [email protected]

Vikram Subramanya National Institute of Technology Karnataka, Surathkal, India [email protected]

Presented at ACM Conference on Information and Knowledge Management (CIKM) 2007 at Lisbon, Portugal from November 6-9, 2007.

Abstract Š Compression of term frequency and very long doc-id lists for inverted files is examined. Traditional methods are not well suited to compressing these lists. Š A novel technique is presented: Sigma Encoding prior to compression. In essence, a parameterized dictionary is used to reduce the number of bits per integer. Š This method shows an ~0.3 bit per integer improvement costing ~3 clock cycles per integer to decompress.

Motivation Š Term frequencies unlike doc-ids, do not form a monotonic sequence. Š As document frequency increases, we expect the average tf to increase. The grammatical conjunctions (but, when, etc.), for example, are expected to occur in a large number of documents and to occur many times. Š The consequence of this is that as document frequency increases, the doc-ids are expected to compress more effectively, but term frequencies less so.

Term Frequency Compression

Š Figure shows the number of bits per integer needed to store the term frequencies for the TREC Wall Street Journal (WSJ) collection using Carryover-12 compression. The mean is shown dashed.

Sigma Encoding Technique Š The compression process for a list of term frequencies proceeds as follows: „ Construct a frequency ordered dictionary (from most to least). „ Sort into increasing order dictionary entries occupying places of the same base-2 magnitude (positions 0-1, 2-3, 4-7, etc.). „ Apply delta encoding to these position-ranges. „ Renumber the original list with the dictionary positions (sigmas).

Illustration of Working

Sigma Encoding (Contd.) Š The final sequence is composed of the dictionary length (|D|), the delta-encoded dictionary, and sigma-encoded term frequency values. Š Additionally the scheme is parameterized. Only terms occurring more than threshold T times are added to the dictionary, the others are stored as |D| + v where |D| is the dictionary length and v is the value from the list (T=1 here).

Compress & Decompress Efficiency Doc-id BPI

Term Freq CPI

BPI

CPI

Elias-δ

8.7

113.8

4.5

65.1

Elias-γ

8.5

102.5

3.4

44.7

Golomb

6.2

99.5

2.7

42.6

Variable Byte

9.4

11.5

8.0

9.0

Simple-9

7.6

15.0

3.1

13.2

Relative-10

7.2

16.9

3.1

13.9

Carryover-12

7.0

16.7

3.1

14.4

Sigma Carryover-12

7.3

20.7

2.8

17.0

Effectiveness For Document-id Lists

Results Š Word-aligned codes are fast and efficient: „

We reproduce prior results that Carryover-12 is nearly as efficient as Golomb and nearly as fast to decompress as Variable Byte Encoding.

Š Sigma encoding is effective for term-frequency lists: „

Table 1 shows an improvement of 0.3 bits per integer (BPI) on WSJ. Wt10g shows a similar improvement.

Results (Contd.) Š Long doc-id lists compress well with sigma encoding: „

Although not, in general, effective for doc-id lists, Figure 3 show that sigma encoding is effective when the lists are long.

Š Sigma encoding is fast to decompress: „

The additional decompression cost is 3 clock cycles per integer (CPI) for WSJ. A similar result is seen for Wt10g.

References [1] Anh, V. N., & Moffat, A. (2002). Improved retrieval effectiveness through impact transformation. Australian Computer Science Communications, 24(2), 41-47. [2] Anh, V. N., & Moffat, A. (2005). Inverted index compression using word-aligned binary codes. Information Retrieval. [3] Anh, V. N., & Moffat, A. (2006). Structured index organizations for high-throughput text querying. In Proceedings of the 13th Int. Symp. String Processing and Information Retrieval, (pp. 304-315). [4] Trotman, A. (2003). Compressing inverted files. Information Retrieval, 6(1), 5-19. [6] Williams, H. E., & Zobel, J. (1999). Compressing integers for fast file access. Computer Journal, 42(3), 193-201. [7] Zobel, J., & Moffat, A. (1995). Adding compression to a full-text retrieval system. Software - Practice and Experience,25(8), 891-903.

Conclusion Sigma Encoding followed by Carryover12 compression is an effective method of storing term frequency lists and long document-id lists in an inverted file search engine.

Sigma Encoded Inverted Files

Compression of term frequency and very long doc-id lists for inverted files is examined. Traditional methods are not well suited to compressing these lists. ◇ A novel technique is presented: Sigma Encoding prior to compression. In essence, a parameterized dictionary is used to reduce the number of bits per integer.

81KB Sizes 1 Downloads 245 Views

Recommend Documents

Sigma Encoded Inverted Files
ABSTRACT. Compression of term frequency lists and very long document-id lists within an inverted file search engine are examined. Several compression schemes are compared including Elias γ and δ codes,. Golomb Encoding, Variable Byte Encoding, and

Linear osmoregulated periplasmic glucans are encoded ... - CiteSeerX
1Department of Microbiology, University of Washington, Seattle, WA 98195, USA. 2Unité de Glycobiologie Structurale et Fonctionnelle CNRS UMR 8576, Université des Sciences ... with pYL205-G are available with the online version of this paper. .... w

Diphthongs & Inverted Diphthongs.pdf
Sign in. Loading… Whoops! There was a problem loading more pages. Retrying... Whoops! There was a problem previewing this document. Retrying.

AIP56, a novel plasmid-encoded virulence factor of ...
Sep 29, 2005 - apoptogenic activity of Phdp AIP56. Plasmid pETAIP56H+ was used to express His-tagged ..... fish were fed ad libitum on commercial pellets. Bacteria. The origin and ..... AI56 inc. GTTCACCAGATTAGCTACGATTGC. AI56P1.

God creation- The Human world- an inverted tree.pdf -
the servants in those good old days were happy and prosperous beyond our ... regarded as 16 degrees pure, i.e. absolutely viceless, righteous to the fullest ...

A cross-reactive neisserial antigen encoded by the ...
Jesu´s Arenas,1 Ana Abel,1 Sandra Sa´nchez,1 Juan Marzoa,1. Sonsoles .... WinMDI software (version 2.9; http://www.bio-soft.net/other.html). Isolation of DNA ...

Characters must also be encoded in binary
AAS. Example #2. Converts a lowercase string to uppercase. 1. char name[] = "milo";. 2. for (int i = 0, j = strlen(name); i < j; i++) name[i] = name[i] + ('A' - 'a');. 3. f.

Watch Patema Inverted (2013) Full Movie Online.pdf
Watch Patema Inverted (2013) Full Movie Online.pdf. Watch Patema Inverted (2013) Full Movie Online.pdf. Open. Extract. Open with. Sign In. Main menu.

Evaluating the Streaming of FGS–Encoded Video with ...
coded, Predicted, and Bi-directionally predicted (I, P, and B) frame types). ..... the Microsoft MPEG–4 software encoder/decoder [60] with FGS functional- ... layer bitstream or the source video, according to the segmentation tool which is used.

polarization encoded all-optical quaternary max gate
1Department of Physics, College of Engineering & Management, Kolaghat, KTPP T/S, Purba- ... 2 Calcutta Institute of Technology, Uluberia , Howrah, W.B. India.

Streaming Layered Encoded Video using Peers
In a conventional video on demand system, videos are stored in a dedicated set of servers. ... (P2P) based video streaming network, users' peers (ordi- nary computers) store and stream the video to the requesting clients. Some of these ... for Advanc

and PD-PID Controllers for a Nonlinear Inverted Pendulum System
nonlinear problem with two degrees of freedom (i.e. the angle of the inverted pendulum ..... IEEE Region 10 Conf. on Computers, Communications, Control and.

Broadband L-probe fed Inverted Hybrid EH Microstrip ...
Array Antenna for 3G Smart Antenna System Testbed ... Group) programme of the Virginia Tech. ... tion of smart antenna system for 3G wireless networks.

Six Sigma
of solutions for business performance improvement using Six Sigma, lean, and business ...... ness goal configuration, deployment plans, implementation schedules, ...... For example, if a coffee house first eliminates the special employeeqtoq.