ACM Home Page
Please provide us with feedback. Feedback
Applying syntactic similarity algorithms for enterprise information management
Full text MovMov (24:27),  PdfPdf (620 KB)
Source
International Conference on Knowledge Discovery and Data Mining archive
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining table of contents
Paris, France
SESSION: Industrial track papers table of contents
Pages 1087-1096  
Year of Publication: 2009
ISBN:978-1-60558-495-9
Authors
Ludmila Cherkasova  Hewlett-Packard Labs, Palo Alto, CA, USA
Kave Eshghi  Hewlett-Packard Labs, Palo Alto, CA, USA
Charles B. Morrey  Hewlett-Packard Labs, Palo Alto, CA, USA
Joseph Tucek  Hewlett-Packard Labs, Palo Alto, CA, USA
Alistair Veitch  Hewlett-Packard Labs, Palo Alto, CA, USA
Sponsors
ACM: Association for Computing Machinery
SIGKDD: ACM Special Interest Group on Knowledge Discovery in Data
SIGMOD: ACM Special Interest Group on Management of Data
Publisher
ACM  New York, NY, USA
Bibliometrics
Downloads (6 Weeks): 23,   Downloads (12 Months): 98,   Citation Count: 0
Additional Information:

abstract   references   index terms   collaborative colleagues  

Tools and Actions: Request Permissions Request Permissions    Review this Article  
DOI Bookmark: Use this link to bookmark this Article: http://doi.acm.org/10.1145/1557019.1557137
What is a DOI?

ABSTRACT

For implementing content management solutions and enabling new applications associated with data retention, regulatory compliance, and litigation issues, enterprises need to develop advanced analytics to uncover relationships among the documents, e.g., content similarity, provenance, and clustering. In this paper, we evaluate the performance of four syntactic similarity algorithms. Three algorithms are based on Broder's "shingling" technique while the fourth algorithm employs a more recent approach, "content-based chunking". For our experiments, we use a specially designed corpus of documents that includes a set of "similar" documents with a controlled number of modifications. Our performance study reveals that the similarity metric of all four algorithms is highly sensitive to settings of the algorithms' parameters: sliding window size and fingerprint sampling frequency. We identify a useful range of these parameters for achieving good practical results, and compare the performance of the four algorithms in a controlled environment. We validate our results by applying these algorithms to finding near-duplicates in two large collections of HP technical support documents.


REFERENCES

Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.

1
 
2
3
 
4
5
6
 
7
D. Fetterly, M. Manasse, M. Najork. On the Evolution of Clusters of Near-Duplicate Web Pages. Journal of Web Engineering, vol.2, N.4, Oct, 2004.
8
9
 
10
N. Heintze. Scalable document fingerprinting. Proc. of the Second USENIX Workshop on Electronic Commerce, 1996.
 
11
12
 
13
 
14
B. Jenkins. Hash Functions. "Algorithm Alley", Dr. Dobb's Journal, September, 1997, http://www.ddj.com/184410284
 
15
16
 
17
M.O. Rabin. Fingerprinting by random polynomials. Technical Report TR-15-81, Center for Research in Computing Technology, Harvard University, 1981.
18
 
19
 
20
A. Tridgell and P. Mackerras. The rsync algorithm. Technical Report TR-CS-96-05, Australian National University, 1996.

Collaborative Colleagues:
Ludmila Cherkasova: colleagues
Kave Eshghi: colleagues
Charles B. Morrey: colleagues
Joseph Tucek: colleagues
Alistair Veitch: colleagues