Sigma Encoded Inverted Files Andrew Trotman University of Otago, New Zealand
[email protected]
Vikram Subramanya National Institute of Technology Karnataka, Surathkal, India
[email protected]
Presented at ACM Conference on Information and Knowledge Management (CIKM) 2007 at Lisbon, Portugal from November 6-9, 2007.
Abstract Compression of term frequency and very long doc-id lists for inverted files is examined. Traditional methods are not well suited to compressing these lists. A novel technique is presented: Sigma Encoding prior to compression. In essence, a parameterized dictionary is used to reduce the number of bits per integer. This method shows an ~0.3 bit per integer improvement costing ~3 clock cycles per integer to decompress.
Motivation Term frequencies unlike doc-ids, do not form a monotonic sequence. As document frequency increases, we expect the average tf to increase. The grammatical conjunctions (but, when, etc.), for example, are expected to occur in a large number of documents and to occur many times. The consequence of this is that as document frequency increases, the doc-ids are expected to compress more effectively, but term frequencies less so.
Term Frequency Compression
Figure shows the number of bits per integer needed to store the term frequencies for the TREC Wall Street Journal (WSJ) collection using Carryover-12 compression. The mean is shown dashed.
Sigma Encoding Technique The compression process for a list of term frequencies proceeds as follows: Construct a frequency ordered dictionary (from most to least). Sort into increasing order dictionary entries occupying places of the same base-2 magnitude (positions 0-1, 2-3, 4-7, etc.). Apply delta encoding to these position-ranges. Renumber the original list with the dictionary positions (sigmas).
Illustration of Working
Sigma Encoding (Contd.) The final sequence is composed of the dictionary length (|D|), the delta-encoded dictionary, and sigma-encoded term frequency values. Additionally the scheme is parameterized. Only terms occurring more than threshold T times are added to the dictionary, the others are stored as |D| + v where |D| is the dictionary length and v is the value from the list (T=1 here).
Compress & Decompress Efficiency Doc-id BPI
Term Freq CPI
BPI
CPI
Elias-δ
8.7
113.8
4.5
65.1
Elias-γ
8.5
102.5
3.4
44.7
Golomb
6.2
99.5
2.7
42.6
Variable Byte
9.4
11.5
8.0
9.0
Simple-9
7.6
15.0
3.1
13.2
Relative-10
7.2
16.9
3.1
13.9
Carryover-12
7.0
16.7
3.1
14.4
Sigma Carryover-12
7.3
20.7
2.8
17.0
Effectiveness For Document-id Lists
Results Word-aligned codes are fast and efficient:
We reproduce prior results that Carryover-12 is nearly as efficient as Golomb and nearly as fast to decompress as Variable Byte Encoding.
Sigma encoding is effective for term-frequency lists:
Table 1 shows an improvement of 0.3 bits per integer (BPI) on WSJ. Wt10g shows a similar improvement.
Results (Contd.) Long doc-id lists compress well with sigma encoding:
Although not, in general, effective for doc-id lists, Figure 3 show that sigma encoding is effective when the lists are long.
Sigma encoding is fast to decompress:
The additional decompression cost is 3 clock cycles per integer (CPI) for WSJ. A similar result is seen for Wt10g.
References [1] Anh, V. N., & Moffat, A. (2002). Improved retrieval effectiveness through impact transformation. Australian Computer Science Communications, 24(2), 41-47. [2] Anh, V. N., & Moffat, A. (2005). Inverted index compression using word-aligned binary codes. Information Retrieval. [3] Anh, V. N., & Moffat, A. (2006). Structured index organizations for high-throughput text querying. In Proceedings of the 13th Int. Symp. String Processing and Information Retrieval, (pp. 304-315). [4] Trotman, A. (2003). Compressing inverted files. Information Retrieval, 6(1), 5-19. [6] Williams, H. E., & Zobel, J. (1999). Compressing integers for fast file access. Computer Journal, 42(3), 193-201. [7] Zobel, J., & Moffat, A. (1995). Adding compression to a full-text retrieval system. Software - Practice and Experience,25(8), 891-903.
Conclusion Sigma Encoding followed by Carryover12 compression is an effective method of storing term frequency lists and long document-id lists in an inverted file search engine.