|
ABSTRACT
We describe some of the design choices that were made during the development of a fast, scalable, inline, deduplication device. The system's design goals and how they were achieved are presented. This is the firs deduplication device that uses similarity matching. The paper provides the following original research contributions: we show how similarity signatures can serve in a deduplication scheme; a novel type of similarity signatures is presented and its advantages in the context of deduplication requirements are explained. It is also shown how to combine similarity matching schemes with byte by byte comparison or hash based identity schemes.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
| |
1
|
|
 |
2
|
|
 |
3
|
|
 |
4
|
|
 |
5
|
Sergey Brin , James Davis , Héctor García-Molina, Copy detection mechanisms for digital documents, Proceedings of the 1995 ACM SIGMOD international conference on Management of data, p.398-409, May 22-25, 1995, San Jose, California, United States
|
| |
6
|
|
| |
7
|
A. Z. Broder. Some applications of Rabin's fingerprinting method. In R. Capocelli, A. De Santis, and U. Vaccaro, editors, Sequences II: Methods in Communications, Security, and Computer Science, 143--152. Springer-Verlag, 1993.
|
| |
8
|
|
 |
9
|
Andrei Z. Broder , Moses Charikar , Alan M. Frieze , Michael Mitzenmacher, Min-wise independent permutations (extended abstract), Proceedings of the thirtieth annual ACM symposium on Theory of computing, p.327-336, May 24-26, 1998, Dallas, Texas, United States
[doi> 10.1145/276698.276781]
|
| |
10
|
|
| |
11
|
Andrei Z. Broder , Steven C. Glassman , Mark S. Manasse , Geoffrey Zweig, Syntactic clustering of the Web, Selected papers from the sixth international conference on World Wide Web, p.1157-1166, September 1997, Santa Clara, California, United States
|
| |
12
|
Fischer M. J., Paterson M. S., String matching and other products, in Complexity of Computation, R. M. Karp (editor), SIAM-AMS Proc. 7 (1974) 113--125.
|
| |
13
|
B. Garret and C. Bouffard, The Enterprise Strategy Group (ESG) lab validation report for IBM TS7650G ProtecTier, 2008. Available at www.diligent.com.
|
| |
14
|
N. Heintze. Scalable Document Fingerprinting. Proceedings of the Second USENIX Workshop on Electronic Commerce, pages 191--200, 1996.
|
| |
15
|
|
| |
16
|
Mark Lillibridge , Kave Eshghi , Deepavali Bhagwat , Vinay Deolalikar , Greg Trezise , Peter Camble, Sparse indexing: large scale, inline deduplication using sampling and locality, Proccedings of the 7th conference on File and storage technologies, p.111-123, February 24-27, 2009, San Francisco, California
|
| |
17
|
|
| |
18
|
Knuth D. E., Morris J. H., Pratt V. R., Fast pattern matching in strings, SIAM Journal on Computing 6 (1977) 323--350.
|
| |
19
|
Purushottam Kulkarni , Fred Douglis , Jason LaVoie , John M. Tracey, Redundancy elimination within large collections of files, Proceedings of the annual conference on USENIX Annual Technical Conference, p.5-5, June 27-July 02, 2004, Boston, MA
|
| |
20
|
|
| |
21
|
|
| |
22
|
Moulton G. H., Whitehill S. B., Hash fil system and method for use in a commonality factoring system, U.S. Pat. No. 6,704,730.
|
 |
23
|
|
| |
24
|
|
 |
25
|
|
| |
26
|
|
| |
27
|
Ukkonen E., On-line construction of suffi trees, Algorithmica 14(3) (1995) 249--260.
|
| |
28
|
|
| |
29
|
|
| |
30
|
J. Ziv and A. Lempel. A universal algorithm for sequential data compression, IEEE Trans. Inform. Theory, vol. IT-23, pp. 337--343, May 1977. 282
|
|