ACM Home Page
Please provide us with feedback. Feedback
Finding similar files in large document repositories
Full text PdfPdf (833 KB)
Source International Conference on Knowledge Discovery and Data Mining archive
Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining table of contents
Chicago, Illinois, USA
SESSION: Industry/government track paper table of contents
Pages: 394 - 400  
Year of Publication: 2005
ISBN:1-59593-135-X
Authors
George Forman  Hewlett-Packard Labs, Palo Alto, CA
Kave Eshghi  Hewlett-Packard Labs, Palo Alto, CA
Stephane Chiocchetti  Hewlett-Packard France, Courtaboeuf, France
Sponsors
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
ACM: Association for Computing Machinery
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 19,   Downloads (12 Months): 123,   Citation Count: 6
Additional Information:

abstract   references   cited by   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1081870.1081916
What is a DOI?

ABSTRACT

Hewlett-Packard has many millions of technical support documents in a variety of collections. As part of content management, such collections are periodically merged and groomed. In the process, it becomes important to identify and weed out support documents that are largely duplicates of newer versions. Doing so improves the quality of the collection, eliminates chaff from search results, and improves customer satisfaction.The technical challenge is that through workflow and human processes, the knowledge of which documents are related is often lost. We required a method that could identify similar documents based on their content alone, without relying on metadata, which may be corrupt or missing.We present an approach for finding similar files that scales up to large document repositories. It is based on chunking the byte stream to find unique signatures that may be shared in multiple files. An analysis of the file-chunk graph yields clusters of related files. An optional bipartite graph partitioning algorithm can be applied to greatly increase scalability.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
 
3
K. Eshghi and H.K. Tang . A Framework for Analyzing and Improving Content-Based Chunking Algorithms. Hewlett-Packard Labs Technical Report TR 2005-30.
 
4
 
5
V. Henson and R. Henderson. Guidelines for Using Compare-by-Hash. Forthcoming, 2005. http://infohost.nmt.edu/~val/review/hash2.html
 
6
U. Manber. Finding similar files in a large file system. In Proceedings of the Winter 1994 USENIX Technical Conference, San Francisco, CA, January 1994.
7
 
8
M.O. Rabin. Fingerprinting by Random Polynomials. Tech. Rep. TR-15-81, Center for Research in Computing Technology, Harvard Univ., Cambridge, Mass., 1981.

CITED BY  6

Collaborative Colleagues:
George Forman: colleagues
Kave Eshghi: colleagues
Stephane Chiocchetti: colleagues