| Applying syntactic similarity algorithms for enterprise information management |
| Full text |
Mov
(24:27),
Pdf
(620 KB)
|
Source
|
International Conference on Knowledge Discovery and Data Mining
archive
Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining
table of contents
Paris, France
SESSION: Industrial track papers
table of contents
Pages 1087-1096
Year of Publication: 2009
ISBN:978-1-60558-495-9
|
|
Authors
|
|
Ludmila Cherkasova
|
Hewlett-Packard Labs, Palo Alto, CA, USA
|
|
Kave Eshghi
|
Hewlett-Packard Labs, Palo Alto, CA, USA
|
|
Charles B. Morrey
|
Hewlett-Packard Labs, Palo Alto, CA, USA
|
|
Joseph Tucek
|
Hewlett-Packard Labs, Palo Alto, CA, USA
|
|
Alistair Veitch
|
Hewlett-Packard Labs, Palo Alto, CA, USA
|
|
| Sponsors |
|
| Publisher |
|
| Bibliometrics |
Downloads (6 Weeks): 23, Downloads (12 Months): 98, Citation Count: 0
|
|
|
ABSTRACT
For implementing content management solutions and enabling new applications associated with data retention, regulatory compliance, and litigation issues, enterprises need to develop advanced analytics to uncover relationships among the documents, e.g., content similarity, provenance, and clustering. In this paper, we evaluate the performance of four syntactic similarity algorithms. Three algorithms are based on Broder's "shingling" technique while the fourth algorithm employs a more recent approach, "content-based chunking". For our experiments, we use a specially designed corpus of documents that includes a set of "similar" documents with a controlled number of modifications. Our performance study reveals that the similarity metric of all four algorithms is highly sensitive to settings of the algorithms' parameters: sliding window size and fingerprint sampling frequency. We identify a useful range of these parameters for achieving good practical results, and compare the performance of the four algorithms in a controlled environment. We validate our results by applying these algorithms to finding near-duplicates in two large collections of HP technical support documents.
REFERENCES
Note: OCR errors may be found in this Reference List extracted from the full text article. ACM has opted to expose the complete List rather than only correct and linked references.
 |
1
|
Sergey Brin , James Davis , Héctor García-Molina, Copy detection mechanisms for digital documents, Proceedings of the 1995 ACM SIGMOD international conference on Management of data, p.398-409, May 22-25, 1995, San Jose, California, United States
|
| |
2
|
|
 |
3
|
Andrei Z. Broder , Moses Charikar , Alan M. Frieze , Michael Mitzenmacher, Min-wise independent permutations (extended abstract), Proceedings of the thirtieth annual ACM symposium on Theory of computing, p.327-336, May 24-26, 1998, Dallas, Texas, United States
[doi> 10.1145/276698.276781]
|
| |
4
|
Andrei Z. Broder , Steven C. Glassman , Mark S. Manasse , Geoffrey Zweig, Syntactic clustering of the Web, Selected papers from the sixth international conference on World Wide Web, p.1157-1166, September 1997, Santa Clara, California, United States
|
 |
5
|
|
 |
6
|
|
| |
7
|
D. Fetterly, M. Manasse, M. Najork. On the Evolution of Clusters of Near-Duplicate Web Pages. Journal of Web Engineering, vol.2, N.4, Oct, 2004.
|
 |
8
|
|
 |
9
|
|
| |
10
|
N. Heintze. Scalable document fingerprinting. Proc. of the Second USENIX Workshop on Electronic Commerce, 1996.
|
| |
11
|
|
 |
12
|
|
| |
13
|
|
| |
14
|
B. Jenkins. Hash Functions. "Algorithm Alley", Dr. Dobb's Journal, September, 1997, http://www.ddj.com/184410284
|
| |
15
|
|
 |
16
|
|
| |
17
|
M.O. Rabin. Fingerprinting by random polynomials. Technical Report TR-15-81, Center for Research in Computing Technology, Harvard University, 1981.
|
 |
18
|
|
| |
19
|
|
| |
20
|
A. Tridgell and P. Mackerras. The rsync algorithm. Technical Report TR-CS-96-05, Australian National University, 1996.
|
|