| Finding similar files in large document repositories |
| Full text |
Pdf
(833 KB)
|
| Source
|
International Conference on Knowledge Discovery and Data Mining
archive
Proceedings of the eleventh ACM SIGKDD international conference on Knowledge discovery in data mining
table of contents
Chicago, Illinois, USA
SESSION: Industry/government track paper
table of contents
Pages: 394 - 400
Year of Publication: 2005
ISBN:1-59593-135-X
|
|
Authors
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 19, Downloads (12 Months): 123, Citation Count: 6
|
|
|
ABSTRACT
Hewlett-Packard has many millions of technical support documents in a variety of collections. As part of content management, such collections are periodically merged and groomed. In the process, it becomes important to identify and weed out support documents that are largely duplicates of newer versions. Doing so improves the quality of the collection, eliminates chaff from search results, and improves customer satisfaction.The technical challenge is that through workflow and human processes, the knowledge of which documents are related is often lost. We required a method that could identify similar documents based on their content alone, without relying on metadata, which may be corrupt or missing.We present an approach for finding similar files that scales up to large document repositories. It is based on chunking the byte stream to find unique signatures that may be shared in multiple files. An analysis of the file-chunk graph yields clusters of related files. An optional bipartite graph partitioning algorithm can be applied to greatly increase scalability.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Sergey Brin , James Davis , Héctor García-Molina, Copy detection mechanisms for digital documents, Proceedings of the 1995 ACM SIGMOD international conference on Management of data, p.398-409, May 22-25, 1995, San Jose, California, United States
|
| |
2
|
|
| |
3
|
K. Eshghi and H.K. Tang . A Framework for Analyzing and Improving Content-Based Chunking Algorithms. Hewlett-Packard Labs Technical Report TR 2005-30.
|
| |
4
|
Raphael A. Finkel , Arkady Zaslavsky , Krisztián Monostori , Heinz Schmidt, Signature extraction for overlap detection in documents, Proceedings of the twenty-fifth Australasian conference on Computer science, p.59-64, January 01, 2002, Melbourne, Victoria, Australia
|
| |
5
|
V. Henson and R. Henderson. Guidelines for Using Compare-by-Hash. Forthcoming, 2005. http://infohost.nmt.edu/~val/review/hash2.html
|
| |
6
|
U. Manber. Finding similar files in a large file system. In Proceedings of the Winter 1994 USENIX Technical Conference, San Francisco, CA, January 1994.
|
 |
7
|
|
| |
8
|
M.O. Rabin. Fingerprinting by Random Polynomials. Tech. Rep. TR-15-81, Center for Research in Computing Technology, Harvard Univ., Cambridge, Mass., 1981.
|
CITED BY 6
|
|
Ludmila Cherkasova , Kave Eshghi , Charles B. Morrey , Joseph Tucek , Alistair Veitch, Applying syntactic similarity algorithms for enterprise information management, Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining, June 28-July 01, 2009, Paris, France
|
|
|
John C. Tang , Clemens Drews , Mark Smith , Fei Wu , Alison Sue , Tessa Lau, Exploring patterns of social commonality among file directories at work, Proceedings of the SIGCHI conference on Human factors in computing systems, April 28-May 03, 2007, San Jose, California, USA
|
|
|
|
|
|
|
|
|
Mark Lillibridge , Kave Eshghi , Deepavali Bhagwat , Vinay Deolalikar , Greg Trezise , Peter Camble, Sparse indexing: large scale, inline deduplication using sampling and locality, Proccedings of the 7th conference on File and stroage technologies, p.111-123, February 24-27, 2009, San Francisco, California
|
|
|
|
|