|
ABSTRACT
Run-Length-Encoding (RLE) is a data compression technique that is used in various applications, e.g., time series, biological sequences, and multimedia databases. One of the main challenges is how to operate on (e.g., index, search, and retrieve) compressed data without decompressing it. In this paper, we introduce the <u>S</u>tring <u>B</u>-tree for <u>C</u>ompressed sequences, termed the SBC-tree, for indexing and searching RLE-compressed sequences of arbitrary length. The SBC-tree is a two-level index structure based on the well-known String B-tree and a 3-sided range query structure [7]. The SBC-tree supports pattern matching queries such as substring matching, prefix matching, and range search operations over RLE-compressed sequences. The SBC-tree has an optimal external-memory space complexity of O(N/B) pages, where N is the total length of the compressed sequences, and B is the disk page size. Substring matching, prefix matching, and range search execute in an optimal O(logB N + |p|+T/B) I/O operations, where |p| is the length of the compressed query pattern and T is the query output size. The SBC-tree is also dynamic and supports insert and delete operations efficiently. The insertion and deletion of all suffixes of a compressed sequence of length m take O(m logB(N + m)) amortized I/O operations. The SBC-tree index is realized inside PostgreSQL. Performance results illustrate that using the SBC-tree to index RLE-compressed sequences achieves up to an order of magnitude reduction in storage, while retains the optimal search performance achieved by the String B-tree over the uncompressed sequences.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
Amihood Amir , Gary Benson , Martin Farach, Let sleeping files lie: pattern matching in Z-compressed files, Proceedings of the fifth annual ACM-SIAM symposium on Discrete algorithms, p.705-714, January 23-25, 1994, Arlington, Virginia, United States
|
| |
2
|
|
| |
3
|
Amihood Amir , Gad M. Landau , Dina Sokol, Inplace run-length 2d compressed search, Proceedings of the eleventh annual ACM-SIAM symposium on Discrete algorithms, p.817-818, January 09-11, 2000, San Francisco, California, United States
|
| |
4
|
|
| |
5
|
|
| |
6
|
|
 |
7
|
Lars Arge , Vasilis Samoladas , Jeffrey Scott Vitter, On two-dimensional indexability and optimal range search indexing, Proceedings of the eighteenth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems, p.346-357, May 31-June 03, 1999, Philadelphia, Pennsylvania, United States
[doi> 10.1145/303976.304010]
|
| |
8
|
R. Bayer and E. M. McCreight. Organization and maintenance of large ordered indices. Acta Informatica, 1:173--189, 1972.
|
 |
9
|
|
| |
10
|
|
| |
11
|
|
| |
12
|
|
 |
13
|
|
| |
14
|
M. Burrows and D. J. Wheeler. A block-sorting lossless data compression algorithm. Technical Report 124, 1994.
|
| |
15
|
Y.-F. Chien, W.-K. Hon, R. Shah, and J. S. Vitter. Compressed text indexing and range searching. Technical Report Purdue University tech. report, CSD TR06-021, DEC 2006.
|
 |
16
|
|
 |
17
|
|
 |
18
|
|
| |
19
|
|
| |
20
|
|
 |
21
|
|
| |
22
|
|
| |
23
|
S. W. Golomb. Run-length encodings. Trans. on Information Theory, 12:399--401, 1966.
|
| |
24
|
|
| |
25
|
|
| |
26
|
|
| |
27
|
|
| |
28
|
|
| |
29
|
V. Makinen and G. Navarro. Dynamic entropy-compressed sequences and full-text indexes. In CMP, pages 306--317, 2006.
|
| |
30
|
|
| |
31
|
|
 |
32
|
|
| |
33
|
E. M. McCreight. Priority search trees. SIAM Journal, 14(2):257--276, 1985.
|
| |
34
|
A. Moffat. Implementing the ppm data compression scheme. Trans. on Communications, 38(11):1917--1921, 1990.
|
 |
35
|
|
| |
36
|
|
| |
37
|
|
| |
38
|
N. S. Prywes and H. J. Gray. The organization of a multilist-type associative memory. In Transactions on Communication and Electronics, pages 488--492, 1963.
|
| |
39
|
|
| |
40
|
|
| |
41
|
Mike Stonebraker , Daniel J. Abadi , Adam Batkin , Xuedong Chen , Mitch Cherniack , Miguel Ferreira , Edmond Lau , Amerson Lin , Sam Madden , Elizabeth O'Neil , Pat O'Neil , Alex Rasin , Nga Tran , Stan Zdonik, C-store: a column-oriented DBMS, Proceedings of the 31st international conference on Very large data bases, August 30-September 02, 2005, Trondheim, Norway
|
| |
42
|
H. Tanaka and A. L. Garcia. Efficient run-length encodings. Trans. on Information Theory, 28(6):880--889, 1982.
|
| |
43
|
|
| |
44
|
|
| |
45
|
|
 |
46
|
|
| |
47
|
P. Weiner. Linear pattern matching algorithms. In Symposium on Switching and Automata Theory, pages 1--11, 1973.
|
| |
48
|
J. Ziv and A. Lempel. A universal algorithm for sequential data compression. Trans. on Information Theory, 23(3):337--343, 1977.
|
| |
49
|
J. Ziv and A. Lempel. Compression of individual sequences via variable-rate coding. Trans. on Information Theory, 24(5):530--536, 1978.
|
|