|
ABSTRACT
The subject of this article is differential compression, the algorithmic task of finding common strings between versions of data and using them to encode one version compactly by describing it as a set of changes from its companion. A main goal of this work is to present new differencing algorithms that (i) operate at a fine granularity (the atomic unit of change), (ii) make no assumptions about the format or alignment of input data, and (iii) in practice use linear time, use constant space, and give good compression. We present new algorithms, which do not always compress optimally but use considerably less time or space than existing algorithms. One new algorithm runs in O(n) time and O(1) space in the worst case (where each unit of space contains [log n] bits), as compared to algorithms that run in O(n) time and O(n) space or in O(n2) time and O(1) space. We introduce two new techniques for differential compression and apply these to give additional algorithms that improve compression and time performance. We experimentally explore the properties of our algorithms by running them on actual versioned data. Finally, we present theoretical results that limit the compression power of differencing algorithms that are restricted to making only a single pass over the data.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
|
| |
2
|
Banga, G., Douglis, F., and Rabinovich, M. 1997. Optimistic deltas for WWW latency reduction. In Proceedings of the 1997 USENIX Annual Technical Conference. USENIX Association, Berkeley, Calif., 289--303.
|
 |
3
|
|
 |
4
|
|
| |
5
|
Chan, M., and Woo, T. 1999. Cache-based compaction: A new technique for optimizing web transfer. In Proceedings of the IEEE Infocom '99 Conference. IEEE Computer Society Press, Los Alamitos, Calif.
|
 |
6
|
|
| |
7
|
de Jong, S. P. 1972. Combining of changes to a source file. IBM Tech. Discl. Bull. 15, 4 (Sept.), 1186--1188.
|
| |
8
|
|
 |
9
|
|
| |
10
|
|
| |
11
|
Hardy, G. H., Littelwood, J. E., and Pólya, G. 1964. Inequalities. Cambridge University Press, London.
|
 |
12
|
|
| |
13
|
|
| |
14
|
|
| |
15
|
Korn, D. G., and Vo, K.-P. 1999. The VCDIFF generic differencing and compression format. Tech. Rep. Internet-Draft draft-vo-vcdiff-00.
|
| |
16
|
|
| |
17
|
MacDonald, J. P. 2000. File system support for delta compression. Masters thesis. Department of Electrical Engineering and Computer Science, University of California at Berkeley, Berkeley, Calif.
|
| |
18
|
Miller, W., and Myers, E. W. 1985. A file comparison program. Softw. Pract. Exper. 15, 11 (Nov.), 1025--1040.
|
 |
19
|
Jeffrey C. Mogul , Fred Douglis , Anja Feldmann , Balachander Krishnamurthy, Potential benefits of delta encoding and data compression for HTTP, Proceedings of the ACM SIGCOMM '97 conference on Applications, technologies, architectures, and protocols for computer communication, p.181-194, September 14-18, 1997, Cannes, France
|
 |
20
|
|
| |
21
|
Rochkind, M. J. 1975. The source code control system. IEEE Trans. Softw. Eng. SE-1, 4 (Dec.), 364--370.
|
 |
22
|
|
| |
23
|
|
| |
24
|
Tudor, P. N. 1995. MPEG-2 video compression. Elect. Commun. Eng. J. 7, 6 (Dec.), 257--264.
|
 |
25
|
|
| |
26
|
Weiner, P. 1973. Linear pattern matching algorithms. In Proceedings of the 14th Annual IEEE Symposium on Switching and Automata Theory. IEEE Computer Society Press, Los Alamitos, Calif., 1--11.
|
| |
27
|
Yao, A. C. 1977. Probabilistic computation: towards a unified measure of complexity. In Proceedings of the 18th Annual IEEE Symposium on Foundations of Computer Science. IEEE Computer Society Press, Los Alamitos, Calif., 222--227.
|
| |
28
|
|
| |
29
|
Ziv, J., and Lempel, A. 1977. A universal algorithm for sequential data compression. IEEE Trans. Inf. Theory 23, 3 (May), 337--343.
|
| |
30
|
Ziv, J., and Lempel, A. 1978. Compression of individual sequences via variable-rate coding. IEEE Trans. Inf. Theory 24, 5 (Sept.), 530--536.
|
CITED BY 13
|
|
|
|
|
Toshiro Takase , Hisashi MIYASHITA , Toyotaro Suzumura , Michiaki Tatsubori, An adaptive, fast, and safe XML parser based on byte sequences memorization, Proceedings of the 14th international conference on World Wide Web, May 10-14, 2005, Chiba, Japan
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Purushottam Kulkarni , Fred Douglis , Jason LaVoie , John M. Tracey, Redundancy elimination within large collections of files, Proceedings of the USENIX Annual Technical Conference 2004 on USENIX Annual Technical Conference, p.5-5, June 27-July 02, 2004, Boston, MA
|
|
|
|
|
|
|
|
|
|
|
|
Mark Lillibridge , Kave Eshghi , Deepavali Bhagwat , Vinay Deolalikar , Greg Trezise , Peter Camble, Sparse indexing: large scale, inline deduplication using sampling and locality, Proccedings of the 7th conference on File and stroage technologies, p.111-123, February 24-27, 2009, San Francisco, California
|
|
|
Chuanyi Liu , Yu Gu , Linchun Sun , Bin Yan , Dongsheng Wang, R-ADMAD: high reliability provision for large-scale de-duplication archival storage systems, Proceedings of the 23rd international conference on Supercomputing, June 08-12, 2009, Yorktown Heights, NY, USA
|
|